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Bedillion Declaration 

. with Response dated 3/3 1/04 

InUSSN: 09/831,805 

IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

DECLARATION OF TOD BEDILION, Ph.D. 
UNDER 37 C.F.R. g 1.132 



I, TOD BEDILION, Ph.D., declare and state as 

follows : 

1. In April, 1996, I became the first employee of 
Synteni, Inc., where I served as Research Director until its 
acquisition by Incyte Corporation in early 1998. After 
Synteni' s acquisition, I continued in the position of Director 
of Corporate Development at Incyte until May 11, 2001. I am 
currently the Director of 'Business Development at Genomic 
Health, Inc., Redwood City, California and an occasional 
Consultant to Incyte. 

2. Synteni was founded to commercialize expression 

microarrays, microarrays in which expressed nucleic acids 

full-length cDNAs, fragments of full-length cDNAs, expressed 
sequence tags (ESTs) -- are arrayed on a common support to 
permit highly parallel detection and measurement of the 
expression of their cognate genes in a biological sample. 

3. During my employ at Synteni, virtually all (if 
not all) of my work efforts were directed to the further 
technical development and the commercial exploitation of that 
microarray technology; given the small size of our shop, most 
of us had both technical and commercial responsibilities. The 
customer accounts for which I was personally responsible 
included large pharmaceutical companies, such as SmithKline 



Beecham, large biotechnology companies, such as Genentee 
small research institutes, such as DNAx Inc. 

<• From ay very first interaction with our 
customers, consistently through to Synteni's acquisition by 
incyte, I heard uniform, consistent, and emphatic requests 
that more genes be added to the arrays. This was true with 
respect to both our original microarrays, based on customer- 
provided genes and libraries, and our later, "generic", aene 
expression microarrays, based upon the unigene clone ' 
collection (our so-called "UniGem" arrays) . Frora day 1# ^ 
pressure on us was to print ever more spots on the array It 
was ssgr a question: our customers wanted ever more genes on 
the array, each new gene-specific probe providing 
incrementally more value to the customer.-' 

5- As a commercial enterprise, providing value to 
our customers was our major concern. Thus, to increase the 
value of our products and services in the marketplace - to 

increase our abilitv tn ^ 

ty t0 sel1 ° ur microarrays and microarray 
services, their "salabilir v« ~. *^ 

baxamiity — our efforts from the very 

beginning were devoted to increasing the number of specific 
genes whose expression could be detected with our microarrays. 

_ 6- Indeed, one of our major competitive advantages 

m the marketplace — nor ■;„«,+• * 

v <*ce not 3 ust as regards other commercial 

suppliers, but also with respect to the innumerable 
laboratories and companies that were attempting to spot arrays 
in their own "home-b rew" facilities ~ was the number of 

encoded gene product «*l lno£\Tt e r\ Tl^^ ° f th ° 

and all expressed genes. asking for probes specific to any 



distinct gene-specific probes that we provided on our 
expression microarrays. Our first 10,000 element UniGem array 
put the holy grail of gene expression analysis - the human 

whole genome array ~ within sight for the very first time 
(with respect to timing of the UniGEM program we began project 

planning and technology development in mid 1996 and delivered 

our first 10,000 element standard content human arrays in the 

first months of 1997 as I recall). 

7. By the end of 1997, our efforts to provide the 
most comprehensive, and thus most valuable, human gene 
expression microarrays had been sufficiently successful that 
Incyte agreed to acquire ' Synteni for a reported $80 million. 

8- I declare further that all statements made 
herein of my own knowledge are true and that ail statements 
made on information and belief are believed to be true, and 
further that these statements were made with the knowledge 
that willful false statements and the like so made are 
punishable by fine or imprisonment, or both, under 
Section 1001 of Title 18 of the United States Code and may 
jeopardize the validity of any patent application in which 
this declaration is filed or any patent that issues thereon. 



Tod Bedilion, Ph.D. Date 



Iyer Declaration 

with Response dated 3/31/04 

InUSSN: 09/831,805 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 



DECLARATION OF VISHWANATH R. IYER, Ph.D. 
UNDER 37 C.F.R. § 1.132 



I, VISHWANATH R. IYER, Ph.D., declare and state as 

follows : 

1. I am an Assistant Professor in the Section of 
Molecular Genetics and Microbiology, Institute of Cellular and 
Molecular Biology, University of Texas at Austin, where my 
laboratory currently studies global transcriptional control in 
yeast, gene expression programs during human cell 
proliferation, and genome-wide transcription factor targets in 
yeast and human. Immediately prior to this position, I spent 
four years as a postdoctoral fellow in the laboratory of 
Patrick O. Brown at Stanford University studying the 
transcriptional programs of yeast and of human cells. My 
curriculum vitae is attached hereto as Exhibit A. 

2. Beginning in Dr. Brown's laboratory, where I 
helped to develop the first whole genome arrays for yeast and 
early versions of highly representative cDNA arrays for human 
cells, and continuing to the present day, I have used 
microarray-based gene expression analysis as a principal 
approach in much of my research. 

3. Representative publications describing this 
work include : 



DeRisi J. et al., "Exploring the metabolic and 
genetic control of gene' expression on a genomic 
scale," Science 278:680-686 (1997) ; 2 

Marton et al., "Drug target validation and 
identification of secondary drug target effects 
using DNA microarrays, " Nature Med. 4:1293-1301 
(1998), - 2 

Iyer et al . , "The transcriptional program in 
the response of human fibroblasts to serum " 
Science 283:83-87 (1999), - 3 and 

Ross et al., "Systematic variation in gene 
expression patterns in human cancer cell lines," 
Wature Genetics 24: 227-235 (2000). 4 

Two of the papers describe our use of microarray-based 
expression profiling to explore the metabolic reprogramming 
that occurs during major environmental changes, both in yeast 
(DeRisi et al., during the shift from fermentation to 
respiration) and in human cells (Iyer et al . , human 
fibroblasts exposed to serum). One reference describes our 
use of expression profile analysis in drug target validation 
and identification of secondary drug effects (Marton et al . ) . 
And one describes our use of expression profiling as a . 
molecular phenotyping tool to discriminate among human cancer 
cells (Ross et al.). 

4. Whether used to elucidate basic physiological 
responses, to study primary and secondary drug effects, or to 
discriminate and classify human cancers, expression profiling 
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as we have practiced it- -r^i-;^^ 

P cuicea it relies for its power on comparison of 

patterns of expression. 

5. For example, we have demonstrated that we can 
use the presence or absence of a characteristic drug 
"signature- pattern of altered gene expression in drug-treated 
cells to explore the mechanism of drug action, and to identify 
secondary effects that can signal potentially deleterious drug 
side effects. As another example, we have demonstrated that 
gene expression patterns can be used to classify human tumor 
cell lines, while it is of course advantageous to know the 
biological function of the encoded gene products in order to 
reach a better understanding of the cellular mechanisms 
underlying these results, these pattern-based analyses do hot 
require knowledge of the biological function of the encoded 
proteins . 

6. The resolution of the patterns used in such 
comparisons is determined by the number of genes detected: the 
greater the number of genes detected, the higher the 
resolution of the pattern. it. goes without saying that higher 
resolution patterns are generally more useful in such 
comparisons than lower resolution patterns. With such higher 
resolutions comes a correspondingly higher degree of 
statistical confidence for distinguishing different patterns, 
as well as identifying similar ones. 

7. Each gene included as a probe on a microarray 
provides a signal that is specific to the cognate transcript, 
at least to a first approximation. 5 Each new gene-specific 



5 _ 

in a more nuanced view 



^« - = u Ua „ tea view, it is certainlv possible fo- a m-nh* r„ 
s.gnal the presence of a variety of splice variants of a si^ Jlne? 

(Continued...) 



probe added to a microarray thus increases the number of genes 
detectable by the device, increasing the resolving power of 
the device. As I note above, higher resolution patterns are 
generally more useful in comparisons than lower resolution 
patterns. Accordingly, each new gene probe added to a 
microarray increases the usefulness of the device in gene 
expression profiling analyses. This proposition is so well- 
established as to be virtually an axiom in the art, and has 
been as long as I have been working in the field, and 
certainly since the time I embarked on the production of whole 
genome arrays in early 1996. Simply put, arrays with fewer 
gene-specific probes are inferior to arrays with more gene- 
specific probes. 

8. For example, our ability to subdivide cancers 
into discriminable classes by expression profiling is limited 
by the resolution of the patterns produced. With more genes 
contributing to the expression patterns, we can potentially 
draw finer distinctions among the patterns, thus subdividing 
otherwise indistinguishable cancers into a greater number of 
classes; the greater the number of classes, the greater the 
likelihood that the cancers classified together will respond 
similarly to therapeutic intervention, permitting better 
individualization of therapy and, we hope, better treatment 
outcomes . 

9. If a gene does not change expression in an 
experiment, or if a gene is not expressed and produces no 



(...Continued) 

without discriminating among them, and for a probe to signal the 
of a variety of allelic variants of a single gene, again without 
discriminating among them. 



signal in an experiment, that is not to say that the probe 
lacks usefulness on the array; it only means that an 
insufficient number of conditions have been sampled to 
identify expression changes. In fact, an experiment showing 
that a gene is not expressed or that its expression level does 
not change can be equally informative. To provide maximum 
versatility as a research tool, the microarray should 
include — and as a biologist I would want my microarray to 
include -- each newly identified gene as a probe. 



herein of my own knowledge are true and that all statements 
made on information and belief are believed to be true, and 
further that these statements were made with the knowledge 
that willful false statements and the like so made are 
punishable by fine or imprisonment, or both, under 
Section 1001 of Title 18 of the United States Code and may 
jeopardize the validity of any patent application in which 
this declaration is filed or any patent that issues thereon. 



10. I declare further that all statements made 
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Exhibit A of Iyer Declaration 
with Response dated 3/31/04 
InUSSN: 09/831,805 



Vishwanath R. Iyer 

Assistant Professor 

Section of Molecular Genetics and Microbiology 

Institute of Cellular and Molecular Biology 

MBB 3.212A, University of Texas at Austin 

Austin, TX 78712-0159 

Phone: 512-232-7833 

Fax: 512-232-3432 

Email: vishy@mail.utexas.edu 

Education/Training 

Bombay University, Mumbai, India B.Sc. (1987), Chemistry & Biochemistry 

M. S. University of Baroda, Baroda, India M.Sc. (1989), Biotechnology 

Harvard University, Cambridge MA Ph.D. (1996), Genetics 

Stanford University, Stanford CA Post-doctoral (1996-2000), Genomics 

Research Experience 

9/ 00-5/03 Assistant professor, Section of Molecular Genetics and 
Microbiology, University of Texas, Austin TX 

■ Global transcriptional control in yeast 

■ Gene expression programs during human cell proliferation 

■ Genome-wide transcription factor targets in yeast and human 

■ Collaborative microarray facility 

5/96-8/00 Post-doctoral fellow Stanford University, Stanford CA 
(Advisor: Dr. Patrick 0. Brown) 

■ Yeast whole-genome ORF and intergenic microarrays 

■ Human cDNA microarrays for expression profiling 

9/89-4/96 Graduate student Harvard University, Cambridge MA 

(Advisor: Dr. Kevin Struhl) 

■ Yeast transcriptional regulation 



Honours and Awards 

Government of India Biotechnology Fellowship (1987-1989) 
University Grants Commission Junior Research Fellowship (1989) 
Stanford University/NHGRI Genome Training Grant (1996) 

Invited Conference talks (selected) 

Invited Lecturer, NEC-Princeton Lectures in Biophysics 

Princeton, NJ (June 1998) 
Plenary Session Speaker, HGM '99 (HUGO Human Genome Meeting) 

Brisbane, Australia (April 1999) 
Invited Speaker, Gordon Research Conference "Human Molecular Genetics" 

Newport, RJ (August 2001) 



Invited Speaker, Nature Genetics "Oncogenomics 2002" Conference 

Dublin, Ireland (May 2002) 
Invited Speaker, "Pathology Bioinformatics" Symposium, University of Michigan, 

Ann Arbor, MI (November 2002) 
Invited Speaker, "Systems Biology: Genomic Approaches to Transcriptional 

Regulation" Cold Spring Harbor Laboratory Meeting (March 2003) 
Symposium co-Chair and Speaker "Functional Genomics" American Society for 

Biochemistry and Molecular Biology Meeting, San Diego, CA (April 2003) 
Invited Speaker in Functional Genomics (Gene Networks) Symposium, International 

Congress of Genetics, Melbourne Australia July 6-11 2003 
Invited Speaker "BioArrays Europe 2003" 

Cambridge, UK (Sep/Oct 2003) 

Departmental Seminars 

Texas A&M University Genetics and Biochemistry & Biophysics Departments, 
October 24 2002 

New York University School of Medicine, Department of Biochemistry, 

November 20 2002 
UT Southwestern Medical Center, Human Genetics Seminar Series, 

May 5 2002 

UCLA School of Medicine, Department of Human Genetics 
June 2 2003 

National Human Genome Research Institute 
June 12 2003 

Sanger Institute of the Wellcome Trust, Hinxton, UK 
Sep 2003 

Other Professional Activities 

Reviewer for Genome Biology, Genome Research, Nature Genetics, Science (1998- 
2003) 

Instructor, Cold Spring Harbor Summer Course "Making and using DNA Microarrays" 
(2000 - 2003) 

Member, NIDDK Special Emphasis Review Panel ZDKi (2001-2002) 
Publications 

1. IyerV. & Struhl, K. (1995) Poly(dA:dT), a ubiquitous promoter element that 

stimulates transcription via its intrinsic DNA structure, EMBOJ. 14: 2570-2579. 

2. IyerV. & Struhl, K. (1995) Mechanism of differentia] utilization of the his3 TR and TC 
TATA elements, Mol. Cell Biol. 15: 7059-7066. 

3. IyerV. & Struhl K. (1996) Absolute mRNA levels and transcription initiation rates in 
Saccharomyces cerevisiae. Proc. Natl Acad. Sci . (USA) 93:5208-5212. 



4. DeRisi J. L., IyerV. R. & Brown P. 0. (1997) Exploring the metabolic and genetic 
control of gene expression on a genomic scale. Science 278:680-686 

5. Marton M. J., DeRisi J. L., Bennett H. A., IyerV. R. . Meyer M. R., Roberts C. J., 

Stoughton R., Burchard J., Slade D., Dai H., Bassett D. E. Jr., Hartwell L. H., Brown 
P. 0. & Friend S. H. (1998) Drug target validation and identification of secondary 
drug target effects using DNA microarrays. Nature Med. 4:1293-1301 

6. Lutfiyya L. L., IyerV. R., DeRisi J., DeVit M. J., Brown P. 0. & Johnston M. (1998) 

Characterization of three related glucose repressors and genes they regulate in 
Saccharomyces cerevisiae. Genetics 150:1377-1391 

7. Spellman P. T., Sherlock G., Zhang M. Q., IverV. R„ Anders K., Eisen M. B., Brown P. 

0., Botstein D. & Futcher B. (1998) Comprehensive identification of cell cycle- 
regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. 
Mol. Biol. Cell 9:3273-3297 

8. IyerV. R., Eisen M. B., Ross D. T., Schuler G., Moore T., Lee J. C.,F., Trent J. M 
Staudt L. M. 5 Hudson Jr. J., Boguski M. S., Lashkari D., Shalon D., Botstein D, & 
Brown P. O. (1999) The transcriptional program in the response of human 
fibroblasts to serum. Science 283:83-87 

9. DeRisi J. L. & IyerV. R. (1999) Genomics and array technology. Curr. Opin Oncol 

11:76-79 

10. Ross D. T., Scherf U., Eisen M. B., Perou C. M., Spellman P., IverV. R.. Rees C, 
Jeffrey S. S., Van de Rijn M., Waltham M., Pergamenschikov A., Lee J. C. F., 
Lashkari D., Shalon D., Myers T. G., Weinstein J. N., Botstein D., & Brown P. 0. 
(2000) Systematic variation in gene expression patterns in human cancer ceil lines. 
Nature Genetics 24: 227-235 

11. Sudarsanam P., IyerV. R., Brown P. 0. & Winston F. (2000) Whole-genome 
expression analysis of snf/swi mutants of S. cerevisiae. Proc. Natl. Acad Sci (USA) 
97:3364-3369 

12. Tran H. G., Steger D. J., IverV. R.. & Johnson A. D. (2000) The chromo domain 
protein Chdip from budding yeast is an ATP-dependent chromatin-modifyine factor 
EMBOJ 19: 2323-2331 

13. Gross C, Kelleher M., IyerV. R., Brown P. 0., & Winge D. R.. (2000) Identification 
of the copper regulon in Saccharomyces cerevisiae by DNA microarrays. J. Biol 
Chem. 275: 32310-32316 

14. Reid J. L., IyerV. R., Brown P. O. & Struhl K. (2000) Coordinate regulation of yeast 
ribosomal protein genes is associated with targeted recruitment of Esai histone 
acetylase. Mol. Cell 6: 1297-1307 



15- IverV. R. . Horak C, Scafe C. S., Botstein D., Snyder M. & Brown P. 0. (2001) 
Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF 
Nature 409: 533-538 

16. Miki R., Kadota K., Bono H., Mizuno Y., Tomaru Y., Carninci P., Itoh M., Shibata K., 
Kawai J., Konno H., Watanabe S., Sato K., Tokusumi Y., Kikuchi N., Ishii Y., 
Hamaguchi Y., Nishizuka I., Goto H., Nitanda H., Satomi S., Yoshiki A., Kusakabe 
M., DeRisi J.L., Eisen M.B., Iyer V.R. . Brown P.O., Muramatsu M., Shimada H., 
Okazaki Y. & Hayashizaki Y. (2001) Delineating developmental and metabolic 
pathways in vivo by expression profiling using the RIKEN set of 18,816 full-length 
enriched mouse cDNA arrays Proc. Natl. Acad. Sci. (USA) 98: 2199-2204 

17. Pollack J. R. & IverV. R. (2002) Characterizing the physical genome. Nature 
Genetics 32 suppl: 515-521 

18. IverV. R. Microarray-based detection of DNA protein interactions: Chromatin 
Immunoprecipitation on Microarrays, in DNA Microarrays: A Molecular Cloning 
Manual (eds. Bowtell, D. & Sambrook, J.) 453-463 (Cold Spring Harbor Laboratory 
Press, 2003). 

* (not peer reviewed) 

19. Killion, P., Sherlock G. and Iver V. R. (2003) The Longhorn Array Database, an 
open-source implementation of the Stanford Microarray Database BMC 
Bioinformatics 4: 32 

20. Hahn J. S., Hu Z., Thiele D. J. & IverV. R. Genome- Wide Analysis of the Biology of 
Stress Responses Through Heat Shock Transcription Factor (submitted to PNAS) 

21. Kim J. & Iver V.R. The global role of TBP recruitment to promoters in mediating 
gene expression profiles (manuscript in preparation) 



Current/Pending Research Support 

U01 AA13518-01 Adron Harris (PI) 25% effort 

9/28/01 - 9/27/06 

NIH/NIAAA 

"INIA: Microarray Core" 

This proposal was a response to the Integrative Neuroscience Initiative on Alcoholism 

(INIA) RFA-AA-01-002. The overall goal is to support the use of microarray technology 

to define changes in gene expression that either predict or accompany excessive alcohol 

consumption. 

Role: Co-investigator 



003658-0223-2001 Iyer (PI) 16% effort 
01/01/02 - 08/31/04 

Texas Higher Education Coordinating Board (ARP) 

"Microarray based global mapping of DNA-protein interactions at promoters in human 
cells" 

This is a pilot project to map the in vivo interactions of transcription factors with human 

promoters 

Role: PI 



Information Technology Research 0325116 R. Mooney (PI) 9% effort 

09/01/03 - 08/31/07 

NSF 

"Feedback from Multi-Source Data Mining to Experimentation for Gene Network 
Discovery" 

Role: Co-investigator 



1 R01 CA95548-01A2 (pending) Iyer (PI) 25% effort 

12/1/03 - 11/30/08 

NIH 

"Analysis of genome-wide transcriptional control in yeast" 

This is a project to identify stress responsive transcription factor targets in yeast through 
the use of DNA microarrays 
Role: PI 



Breast Cancer Idea Award (pending) Iyer (PI) 10% effort 
1/1/04 ~ 12/31/06 

US Army Medical Research and Materiel Command 

"Genome-wide chromosomal targets of oncogenic transcription factors" 

This is a project aimed at identifying direct chromosomal targets of c-myc and ER in 

human cells through the use of a novel sequence tag analysis method. 

Role: PI 

003658-0531-2003 (pending) Marcotte (PI) 8% effort 
01/01/04-12/31/05 

Texas Higher Education Coordinating Board (ATP) 

"Cell arrays: A novel high-throughput platform for measuring gene function on a 
genomic scale" 

This proposal is aimed at developing a novel microarray based platform for automated, 
high-throughput microscopic imaging of cells, allowing rapid and systematic evaluation 
of gene function. 
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using a simple robotic printing device (9). 
Cells from an exponentially growing culture 
of yeast were inoculated into fresh medium 
and grown at 30°C for 21 hours. After an 
initial 9 hours of growth, samples were har- 
vested at seven successive 2-hour intervals, 
and mRNA was isolated (10). Fluorescently 
labeled cDNA was prepared by reverse tran- 
scription in the presence of Cy3(green)- 
or Cy5(red)-labeled deoxyuridine triphos- 
phate (dUTP) (J J) and then hybridized to 
the microarrays (12). To maximize the re- 
liability with which changes in expression 
levels could be discerned, we labeled cDNA * 
prepared from cells at each successive time 
point with Cy5, then mixed it with a Cy3- 
labeled "reference" cDNA sample prepared 
from cells harvested at the first interval 
after inoculation. In this experimental de- 
sign, the relative fluorescence intensity 
measured for the Cy3 and Cy5 fluors at 
each array element provides a reliable mea- 
sure of the relative abundance of the corre- 
sponding mRNA in the two cell popula- 
tions (Fig. 1). Data from the series of seven 
samples (Fig. 2), consisting of more than 
43 ,000 expression-ratio measurements, 
were organized into a database to facilitate 
efficient exploration and analysis of the 
results. This database is publicly available 
on the Internet (13). 

During exponential growth in glucose- 
rich medium, the global pattern of gene 
expression was remarkably stable. Indeed, 
when gene expression patterns between the 
first two cell samples (harvested at a 2-hour 
interval) were compared, mRNA levels dif- 
fered by a factor of 2 or more for only 19 
genes (0.3%), and the largest of these dif- 
ferences was only 2.7-fold ( 14). However, as 
glucose was progressively depleted from the 
growth media during the course of the ex- - 
periment, a marked change was seen in the 
global pattern of gene expression. mRNA 
levels for approximately 710 genes were 
induced by a factor of at least 2, and the 
mRNA levels for approximately 1030 genes 
declined by a factor of at least 2. Messenger 
RNA levels for 183 genes increased by a 
factor of at least 4, and mRNA levels for 
203 genes diminished by a factor of at least 
4- About half of these differentially ex- 
pressed genes have no currently recognized 
function and are not yet named. Indeed, 
more than 400 of the differentially ex- 
pressed genes have no apparent homology 



Exploring the Metabolic and Genetic Control of 
Gene Expression on a Genomic Scale 

Joseph L DeRisi, Vishwanath R. Iyer, Patrick 0. Brown* 

DNA microarrays containing virtually every gene of Saccharomyces cerevisiae were used 
to carry out a comprehensive investigation of the temporal program of gene expression 
accompanying the metabolic shift from fermentation to respiration. The expression 
profiles observed for genes with known metabolic functions pointed to features of the 
metabolic reprogramming that occur during the diauxic shift, and the expression patterns 
of many previously uncharacterized genes provided clues to their possible functions. The 
same DNA microarrays were also used to identify genes whose expression was affected 
by deletion of the transcriptional co-repressor TUP1 or overexpression of the transcrip- 
tional activator YAP1. These results demonstrate the feasibility and utility of this ap- 
proach to genomewide exploration of gene expression patterns. 



Xhe complete sequences of nearly a dozen 
microbial genomes are known, and in the 
next several years we expect to know the 
complete genome sequences of several 
metazoans, including the human genome. 
Defining the role of each gene in these 
genomes will be a formidable task, and un- 
derstanding how the genome functions as a 
whole in the complex natural history of a 
living organism presents an even greater 
challenge. 

Knowing when and where a gene is 
expressed often provides a strong clue as to 
its biological role. Conversely, the pattern 
of genes expressed in a cell can provide 
detailed information about its state. Al- 
though regulation of protein abundance in 
a cell is by no means accomplished solely 
by regulation of mRNA, virtually all dif- 
ferences in cell type or state are correlated 
with changes in the mRNA levels of many 
genes. This is fortuitous because the only 
specific reagent required to measure the 
abundance of the mRNA for a specific 
gene is a cDNA sequence. DNA microar- 
rays, consisting of thousands of individual 
gene sequences printed in a high-density 
array on a glass microscope slide (J, 2), 
provide a practical and economical tool 
for studying gene expression on a very 
large scale (3-6). 

Saccharomyces cerevisiae is an especially 

Department of Biochemistry, Stanford University School 
of Medicine. Howard Hughes Medical institute. Stanford, 
CA 94305-5428. USA. 

"To whom correspondence should be addressed. E-mail: 
ptxcwn@cfnp/n.stanfort.edu 



favorable organism in which to conduct a 
systematic investigation of gene expression. 
The genes are easy to recognize in the ge- 
nome sequence, cis regulatory elements are 
generally compact and close to the tran- 
scription units, much is already known 
about its genetic regulatory mechanisms, 
and a powerful set of tools is available for its 
analysis. 

A recurring cycle in the natural history 
of yeast involves a shift from anaerobic 
(fermentation) to aerobic (respiration) me- 
tabolism. Inoculation of yeast into a medi- 
um rich in sugar is followed by rapid growth 
fueled by fermentation, with the production 
of ethanol. When the fermentable sugar is 
exhausted, the yeast cells tum to ethanol as 
a carbon source for aerobic growth. This 
switch from anaerobic growth to aerobic 
respiration upon depletion of glucose, re- 
ferred to as the diauxic shift, is correlated 
with widespread changes in the expression 
of genes involved in fundamental cellular 
processes such as carbon metabolism, pro- 
tein synthesis, and carbohydrate storage 
(7). We used DNA microarrays to charac- 
terize the changes in gene expression that 
take place during this process for nearly the 
entire genome, and to investigate the ge- 
netic circuitry that regulates and executes 
this program. 

Yeast open reading frames (ORFs) were 
amplified by the polymerase chain reaction 
(PCR), with a commercially available set of 
primer pairs (8). DNA microarrays, con- 
taining approximately 6400 distinct DNA 
sequences, were printed onto glass slides by 
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to any gene whose function is known (15), 
The responses of these previously unchar- 
acteriied genes to the diauxic shift therefore 
provides the first small clue to their possible 
roles. 

The global view of changes in expres- 
sion of genes with known functions pro- 
vides a vivid picture of the way in which 
the cell adapts to a changing environ- 
ment. Figure 3 shows a portion of the yeast 
metabolic pathways involved in carbon 
and energy metabolism. Mapping the 
changes we observed in the mRNAs en- 
coding each enzyme onto this framework 
allowed us to infer the redirection in the 
flow of metabolites through this system. 
We observed large inductions of the genes 
coding for the enzymes aldehyde dehydro- 
genase (ALD2) and acetyl-coenzyme 
A(CoA) synthase (ACS1), which func- 
tion together to convert the products of 
alcohol dehydrogenase into acetyl-CoA, 
which in turn is used to fuel the tricarbox- 
ylic acid (TCA) cycle and the glyoxylate 
cycle. The concomitant shutdown of tran- 
scription of the genes encoding pyruvate 
decarboxylase and induction of pyruvate 
carboxylase rechannels pyruvate away 
from acetaldehyde, and instead to oxalac- 
etate, where it can serve to supply the 
TCA cycle and gluconeogenesis. Induc- 
tion of the pivotal genes PCK1, encoding 
phosphoenolpyruvate carboxykinase, and 
FBP/, encoding fructose 1,6-biphos- 
phatase, switches the directions of two key 
irreversible steps in glycolysis, reversing 
the flow of metabolites along the revers- 
ible steps of the glycolytic pathway toward 
the essential biosynthetic precursor, glu- 
coses-phosphate. Induction of the genes 
coding for the trehalose synthase and gly- 
cogen synthase complexes promotes chan- 
neling of glucose-6-phosphate into these 
carbohydrate storage pathways. 

Just as the changes in expression of 
genes encoding pivotal enzymes can pro- 
vide insight into metabolic reprogram- 
ming, the behavior of large groups of func- 
tionally related genes can provide a broad 
view of the systematic way in which the 
yeast cell adapts to a changing environ- 
ment (Fig. 4). Several classes of genes, 
such as cytochrome c-related genes and 
those involved in the TCA/glyoxylate cy- 
cle and carbohydrate storage, were coordi- 
nate^ induced by glucose exhaustion. In 
contrast, genes devoted to protein synthe- 
sis, including ribosomal proteins, tRNA 
synthetases, and translation, elongation, 
and initiation factors, exhibited a coordi- 
nated decrease in expression. More than 
95% of ribosomal genes showed at least 
twofold decreases in expression during the 
diauxic shift (Fig. 4) (13). A noteworthy 
and illuminating exception was that the 



genes encoding mitochondrial ribosomal 
genes were generally induced rather than 
repressed after glucose limitation, high- 
lighting the requirement for mitchondrial 
biogenesis (13). As more is learned about 
the functions of every gene in the yeast 
genome, the ability to gain insight into a 
cell's response to a changing environment 
through its global gene expression patterns 
will become increasingly powerful. 

Several distinct temporal patterns of ex- 
pression could be recognized, and sets of 
genes could be grouped on the basis of the 
similarities in their expression patterns. The 
characterized members of each of these 
groups also shared important similarities in 
their functions. Moreover, in most cases, 
common regulatory mechanisms could be 
inferred for sets of genes with similar expres- 
sion profiles. For example, seven genes 
showed a late induction profile, with mRNA 
levels increasing by more than ninefold at 



the last timepoint but less than threefold at 
the preceding timepoint (Fig. 5B). All of 
these genes were known to be glucose-re- 
pressed, and five of the seven were previously 
noted to share a common upstream activat- 
ing sequence (UAS), the carbon source re- 
sponse element (CSRE) (J 6-20). A search 
in the promoter regions of the remaining two 
genes, ACRl and ID?!, revealed that 
ACRi, a gene essential for ACS! activity, 
also possessed a consensus CSRE motif, but 
interestingly, IDP2 did not. A search of the 
entire yeast genome sequence for the con- 
sensus CSRE motif revealed only four addi- 
tional candidate genes, none of which 
showed a similar induction. 

Examples from additional groups of 
genes that shared expression profiles are 
illustrated in Fig. 5, C through F. The 
sequences upstream of the named genes in 
Fig. 5C all contain stress response ele- 
ments (STRE), and with the exception 




Fig. 1. Yeast genome microarray. The actual size of the microarray is 18 mm by 18 mm. The 
microarray was printed as described (9). This image was obtained with the same fluorescent 
scanning confocal microscope used to collect all the data we report {49). A fluorescently labeled 
cDNA probe was prepared from mRNA isolated from cells harvested shortly after inoculation (culture 
density of <5 x 10 6 cells/ml and media glucose level of 19 g/liter) by reverse transcription in the 
presence of Cy3-dUTP, Similarly, a second probe was prepared from mRNA isolated from cells taken 
from the same culture 9.5 hours later (culture density of ~2 x 10 8 cells/ml, with a glucose level of 
<0,2 g/liter) by reverse transcription in the presence of Cy5-dUTP. In this image, hybridization of the 
Cy3-dUTP-labeled cDNA (that is, mRNA expression at the initial timepoint) is represented as a green 
signal, and hybridization of Cy5-dUTP-labeled cDNA (that is, mRNA expression at 9.5 hours) is 
represented as a red signal. Thus, genes induced or repressed after the diauxic shift appear in this 
image as red and green spots, respectively. Genes expressed at roughly equal levels before and after 
the diauxic shift appear in this image as yellow spots. 
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of HSP42, have previously been shown to 
be controlled at least in part by these 
elements (21-24)- Inspection of the se- 
quences upstream of HSP42 and the two 
uncharacterized genes shown in Fig. 5C, 
YKL026c, a hypothetical protein with 
similarity to glutathione peroxidase, and 
YGR043c, a putative transaldolase, re- 
vealed that each of these genes also pos- 
sess repeated upstream copies of the stress- 
responsive CCCCT motif. Of the 13 ad- 
ditional genes in the yeast genome that 
shared this expression profile [including 
HSP30, ALD2, OM45, and 10 uncharac- 
terized ORFs (25)), nine contained one or 
more recognizable STRE sites in their up- 
stream regions. 

The heterotrimeric transcriptional acti- 
vator complex HAP2,3,4 has been shown 
to be responsible for induction of several 
genes important for respiration (26-28). 
This complex binds a degenerate consensus 
sequence known as the CCA AT box (26). 
Computer analysis, using the consensus se- 
quence TNRYTGGB (29), has suggested 
that a large number of genes involved in 
respiration may be specific targets of 
HAP2,3,4 (30). Indeed, a putative 
HAP2,3,4 binding site could be found in 
the sequences upstream of each of the seven 
• cytochrome c-related genes that showed 
the greatest magnitude of induction (Fig. 
5 D). Of 12 additional cytochrome c-related 
genes that were induced, HAP2,3,4 binding 
sites were present in all but one. Signifi- 
cantly, we found that transcription of 
HAP4 itself was induced nearly ninefold 
concomitant with the diauxic shift. 

Control of ribosomal protein biogenesis 
is mainly exerted at the transcriptional 
level, through the presence of a common 
upstream-activating element (UAS ) 
that is recognized by the Rapl DNA-bind- 
ing protein (3 J, 32). The expression pro- 
files of seven ribosomal proteins are shown 
in Fig. 5F. A search of the sequences 
upstream of all seven genes revealed con- 
sensus Rapl-binding motifs (33). It has 
been suggested that declining Rapl levels 
in the cell during starvation may be re- 
sponsible for the decline in ribosomal pro- 
tein gene expression (34). Indeed, we ob- 
served that the abundance of RAP J 
mRNA diminished by 4.4-fold, at about 
the time of glucose exhaustion. 

Of the 1 49 genes that encode known or 
putative transcription factors, only two, 
HAP4 and S/P4, were induced by a factor of 
more than threefold at the diauxic shift. 
SIP4 encodes a DNA-binding transcrip- 
tional activator that has been shown to 
interact with Snfl, the "master regulator" of 
glucose repression (35). The eightfold in- 
duction of SIP4 upon depletion of glucose 
strongly suggests a role in the induction of 



downstream genes at the diauxic shift. 

Although most of the transcriptional 
responses that we observed were not pre- 
viously known, the responses of many 
genes during the diauxic shift have been 
described. Comparison of the results we 
obtained by DNA microarray hybridiza- 
tion with previously reported results there- 
fore provided a strong test of the sensitiv- 
ity and accuracy of this approach. The 
expression patterns we observed for previ- 
ously characterized genes showed almost 
perfect concordance with previously pub- 
lished results (36). Moreover, the differ- 
ential expression measurements obtained 
by DNA microarray hybridization were re- 
producible in duplicate experiments. For 
example, the remarkable changes in gene 
expression between cells harvested imme- 
diately after inoculation and immediately 
after the diauxic shift (the first and sixth 
intervals in this time series) were mea- 
sured in duplicate, independent DNA mi- 
croarray hybridizations. The correlation 
coefficient for two complete sets of expres- 
sion ratio measurements was 0.87, and for 
more than 95% of the genes, the expres- 



sion ratios measured in these duplicate 
experiments differed by less than a factor 
of 2. However, in a few cases, there were 
discrepancies between our results and pre- 
vious results, pointing to technical limita- 
tions that will need to be addressed as 
DNA microarray technology advances 
(37, 38). Despite the noted exceptions, 
the high concordance between the results 
we obtained in these experiments and 
those of previous studies provides confi- 
dence in the reliability and thoroughness 
of the survey. 

The changes in gene expression during 
this diauxic shift are complex and involve 
integration of many kinds of information 
about the nutritional and metabolic state 
of the cell. The large number of genes 
whose expression is altered and the diver- 
sity of temporal expression profiles ob- 
served in this experiment highlight the 
challenge of understanding the underlying 
regulatory mechanisms. One approach to 
defining the contributions of individual 
regulatory genes to a complex program of 
this kind is to use DNA microarrays to 
identify genes whose expression is affected 



Fig. 2. The section of the ar- 
ray indicated by the gray box 
in Fig. 1 is shown for each of 
the experiments described 
here. Representative genes 
are labeled. In each of the ar- 
rays used to analyze gene 
expression during the diauxic 
shift , red spots represent 
genes that were induced rel- 
ative to the initial timepoint, 
and green spots represent 
genes that were repressed 
relative to the initial timepoint. 
In the arrays used to analyze 
the effects of the tuplb mu- 
tation and YAP 1 overexpres- 
sion, red spots represent 
genes whose expression was 
increased, and green spots 
represent genes whose ex- 
pression was decreased by 
the genetic modification. Note 
that distinct sets of genes are 
induced and repressed in the 
different experiments. The 
complete images of each of 
these arrays can be viewed on 
the Internet (73). Cell density 
as measured by optica! densi- 
ty (OD) at 600 nm was used to 
measure the growth of the 
culture. 
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by mutations in each putative regulatory 
gene. As a test of this strategy, we analyzed 
the genomewide changes in gene expression 
that result from deletion of the TUPl gene. 
Transcriptional repression of many genes by 
glucose requires the DNA-binding repressor 



Migl and is mediated by recruiting the tran- 
scriptional co- repressors Tupl and Cyc8/ 
Ssn6 (39). Tupl has also been implicated in 
repression of oxygen-regulated, mating-type— 
specific, and DNA-damage-inducible genes 



Wild-type yeast cells and cells bearing 
a deletion of the TUP J gene (tupl A) were 
grown in parallel cultures in rich medium 
containing glucose as the carbon source. 
Messenger RNA was isolated from expo- 
nentially growing cells from the two pop- 
ulations and used to prepare cDNA la- 
beled with Cy3 (green) and Cy5 (red), 
respectively (11). The labeled probes were 
mixed and simultaneously hybridized to 
the microarray. Red spots on the microar- 
ray therefore represented genes whose 
transcription was induced in the tup J A 
strain, and thus presumably repressed by 
Tupl (41 ). A representative section of the 
microarray (Fig. 2, bottom middle panel) 
illustrates that the genes whose expression 
was affected by the tup J A mutation, were, 
in general, distinct from those induced 
upon glucose exhaustion [complete images 
of all the arrays shown in Fig. 2 are avail- 
able on the Internet (13)]. Nevertheless, 
34 (10%) of the genes that were induced 
by a factor of at least 2 after the diauxic 
shift were similarly induced by deletion of 
TUPl, suggesting that these genes may be 
subject to TUP J -mediated repression by 
glucose. For example, SUC2, the gene en- 
coding invertase, and all five hexose trans- 
porter genes that were induced during the 
course of the diauxic shift were similarly 
induced, in duplicate experiments, by the 
deletion of TUPl. 

The set of genes affected by Tupl in this 
experiment also included a-glucosidases, 
the ma ting-type-specific genes MFA1 and 
MFA2, and the DNA damage-indue ible 
RNR2 and RNR4, as well as genes involved 
in flocculation and many genes of unknown 
function. The hybridization signal corre- 
sponding to expression of TUPl itself was 
also severely reduced because of the (in- 
complete) deletion of the transcription unit 
in the tupl A strain, providing a positive 
control in the experiment (42). 

Many of the transcriptional targets of 
Tupl fell into sets of genes with related 
biochemical functions. For instance, al- 
though only about 3% of all yeast genes 
appeared to be TUPl -repressed by a factor 
of more than 2 in duplicate experiments 
under these conditions, 6 of the 13 genes 
that have been implicated in flocculation 
(15) showed a reproducible increase in 
expression of at least twofold when TUPl 
was deleted. Another group of related 
genes that appeared to be subject to TUPl 
repression encodes the serine-rich cell 
wall mannoproteins, such as Tipl and 
Tirl/Srpl which are induced by cold 
shock and other stresses (43), and similar, 
serine-poor proteins, the seripauperins 
(44). Messenger RNA levels for 23 of the 
26 genes in this group were reproducibly 
elevated by at least 2.5-fold in the tup J A 
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Fig. 3. Metabolic reprogramming inferred from global analysis of changes in gene expression. Only key 
metabolic intermediates are identified. The yeast genes encoding the enzymes that catalyze each step 
in this metabolic circuit are identified by name in the boxes. The genes encoding succinyl-CoA synthase 
and gtycogen-debranching enzyme have not been explicitly identified, but the ORFs YGR244 and 
YPR184 show significant homology to known succinyl-CoA synthase and grycogen-debranching en- 
zymes, respectively, and are therefore included in the corresponding steps in this figure. Red boxes with 
white lettering identify genes whose expression increases in the diauxic shift. Green boxes with dark 
green lettering identify genes whose expression diminishes in the diauxic shift. The magnitude of 
induction or repression is indicated for these genes. For multimeric enzyme complexes, such as 
succinate dehydrogenase, the indicated fold-induction represents an unweighted average of all the 
genes listed in the box. Black and white boxes indicate no significant differential expression (less than 
twofold). The direction of the arrows connecting reversible enzymatic steps indicate the direction of the 
flow of metabolic intermediates, inferred from the gene expression pattern, after the diauxic shift. Arrows 
representing steps catalyzed by genes whose expression was strongly induced are highlighted in red. 
The broad gray arrows represent major increases in the flow of metabolites after the diauxic shift, 
inferred from the indicated changes in gene expression. 
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strain, and 18 of these genes were induced 
by more than sevenfold when TUP! was 
deleted. In contrast, none of 83 genes that 
could be classified as putative regulators of 
the cell division cycle were induced more 
than twofold by deletion of TUPL Thus, 
despite the diversity of the regulatory sys- 
tems that employ Tupl, most of the genes 
that it regulates under these conditions 
fall into a limited number of distinct func- 
tional classes. 

Because the microarray allows us to 
monitor expression of nearly every gene in 
yeast, we can, in principle, use this ap- 
proach to identify all the transcriptional 
targets of a regulatory protein like Tupl. It 
is important to note, however, that in any 
single experiment of this kind we can only 
recognize those target genes that are nor- 
mally repressed (or induced) under the 
conditions of the experiment. For in- 
stance, the experiment described here an- 
alyzed a MAT a strain in which MFA1 
and MFA2, the genes encoding the a- 
factor mating pheromone precursor, are 
normally repressed. In the isogenic tup J A 
strain, these genes were inappropriately 
expressed, reflecting the role that Tupl 
plays in their repression. Had we instead 
carried out this experiment with a MATA 
strain (in which expression of MFAI and 
MFA2 is not repressed), it would not have 
been possible to conclude anything re- 
garding the role of Tupl in the repression 
of these genes. Conversely, we cannot dis- 
tinguish indirect effects of the chronic 
absence of Tupl in the mutant strain from 
effects directly attributable to its partici- 
pation in repressing the transcription of a 
gene. 

Another simple route to modulating the 
activity of a regulatory factor is to overex- 
press the gene that encodes it. YAPJ en- 
codes a DNA-binding transcription factor 
belonging to the b-zip class of DNA-bind- 
ing proteins. Overexpression of YAPV in 
yeast confers increased resistance to hydro- 
gen peroxide, o-phenanthroline, heavy 
metals, and osmotic stress (45). We ana- 
lyzed differential gene expression between a 
wild-type strain bearing a control plasmid 
and a strain with a plasmid expressing VAPi 
under the control of the strong GAL1-J0 
promoter, both grown in galactose (that is, 
a condition that induces YAP/ overexpres- 
sion). Complementary DNA from the con- 
trol and YAPJ overex pressing strains, la- 
beled with Cy3 and Cy5, respectively, was 
prepared from mRNA isolated from the two 
strains and hybridized to the microarray. 
Thus, red spots on the array represent genes 
that were induced in the strain overexpress- 
ingYAPJ. 

Of the 17 genes whose mRNA levels 
increased by more than threefold when 



might play an important protective role 
during oxidative stress. Transcription of a 
small number of genes was reduced in the 
strain overexpressing Yapl. Interestingly, 
many of these genes encode sugar per- 
meases or enzymes involved in inositol 
metabolism. 

We searched for Yapl-binding sites 
(TTACTAA or TGACTAA) in the se- 
quences upstream of the target genes we 
identified (48). About two-thirds of the 
genes that were induced by more than 
threefold upon Yapl overexpression had 
one or more binding sites within 600 bases 
upstream of the start codon (Table 1), sug- 
gesting that they are directly regulated by 
Yapl. The absence of canonical Yapl-bind- 



YAPl was overexpressed in this way, five 

bear homology to aryl-alcohol oxidoreduc- 

tases (Fig. 2 and Table 1). An additional 

four of the genes in this set also belong to 

the general class of dehydrogenases/oxi- 

doreductases. Very little is known about 

the role of aryl-alcohol oxidoreductases in 
S. cerevisiae, but these enzymes have been 

isolated from ligninolytic fungi, in which 
they participate in coupled redox reac- 
tions, oxidizing aromatic, and aliphatic 
unsaturated alcohols to aldehydes with the 
production of hydrogen peroxide (46, 47). 
The fact that a remarkable fraction of the 
targets identified in this experiment be- 
long to the same small, functional group of 
oxidoreductases suggests that these genes 

Fig. 4. Coordinated reg- 
ulation of functionally re- 
lated genes. The curves 
represent the average in- 
duction or repression ra- 
tios for all the genes in 
each indicated group. 
The total number of 
genes in each group was 
as follows: ribosomaJ 
proteins, 112; translation 
elongation and initiation 

factors, 25; tRNA synthetases (excluding mitochondiaJ synthetases), 17; glycogen and trehalose svn- 
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ing sites upstream of the others may reflect 
an ability of Yapl to bind sites that differ 
from the canonical binding sites, perhaps in 
cooperation with other factors, or less like- 
ly, may represent an indirect effect of Yapl 
overexpression, mediated by one or more 
intermediary factors. Yapl sites were found 
only four times in the corresponding region 
of an arbitrary set of 30 genes that were not 
differentially regulated by Yapl. 

Use of a DNA microarray to character- 
ize the transcriptional consequences of 
mutations affecting the activity of regula- 
tory molecules provides a simple and pow- 
erful approach to dissection and character- 
ization of regulatory pathways and net- 



works. This strategy also has an important 
practical application in drug screening. 
Mutations in specific genes encoding can- 
didate drug targets can serve as surrogates 
for the ideal chemical inhibitor or modu- 
lator of their activity. DNA microarrays 
can be used to define the resulting signa- 
ture pattern of alterations in gene expres- 
sion, and then subsequently used in an 
assay to screen for compounds that repro- 
duce the desired signature pattern. 

DNA microarrays provide a simple and 
economical way to explore gene expres- 
sion patterns on a genomic scale. The 
hurdles to extending this approach to any 
other organism are minor. The equipment 



required for fabricating and using DNA 
microarrays (9) consists of components 
that were chosen for their modest cost and 
simplicity. It was feasible for a small group 
to accomplish the amplification of more 
than 6000 genes in about 4 months and, 
once the amplified gene sequences were in 
hand, only 2 days were required to print a 
set of 110 microarrays of 6400 elements 
each. Probe preparation, hybridization, 
and fluorescent imaging are also simple 
procedures. Even conceptually simple ex- 
periments, as we described here, can yield 
vast amounts of information. The value of 
the information from each experiment of 
this kind will progressively increase as 
more is learned about the functions of 
each gene and as additional experiments 
define the global changes in gene expres- 
sion in diverse other natural processes and 
genetic perturbations. Perhaps the greatest 
challenge now is to develop efficient 
methods for organizing, distributing, inter- 
preting, and extracting insights from the 
large volumes of data these experiments 
will provide. 
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arraying robot was used to print on a batch of 1 1 0 
slides. Details of the design of the microarrayer are 
available at cmgm.stanford.edu/pbrown. After print- 
ing, the microarrays were rehydrated for 30 s in a 
humid chamber and then snap-dried for 2 s on a hot 
plate (100*C). The DNA was then ultraviolet (UV> 
crossiinked to the surface by subjecting the slides to 
60 mJ of energy (Stratagene Stratalinker). The rest of 
the poly-L-lysine surface was blocked by a 15-min 
incubation in a solution of 70 mM succinic anhydride 
dissolved in a solution consisting of 31 5 ml of 1 - 
methyl-2-pyrrolidinone (AkJrich) and 35 ml of 1 M 
boric acid (pH 8.0). Directly after the blocking reac- 
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Fig. 5. Distinct temporal patterns of induction or repression help to group genes that share regulatory 
properties. (A) Temporal profile of the cell density, as measured by OD at 600 nm and glucose 
concentration in the media. (B) Seven genes exhibited a strong induction (greater than ninefold) only at 
the last timepoint (20.5 hours). With the exception of IDP2 t each of these genes has a CSRE UAS. There 
were no additional genes observed to match this profile, (C) Seven members of a class of genes marked 
by earty induction with a peak in mRNA levels at 18.5 hours. Each of these genes contain STRE motif 
repeats in their upstream promoter regions. (D) Cytochrome c oxidase and ubiquinol cytochrome c 
reductase genes. Marked by an induction coincident with the diauxic shift, each of these genes contains 
a consensus binding motif for the HAP2,3,4 protein complex. At least 1 7 genes shared a similar 
expression profile. (E) SAM1, GPP1, and several genes of unknown function are repressed before the 
diauxic shift, and continue to be repressed upon entry into stationary phase. (F) Ribosomal protein 
genes comprise a large class of genes that are repressed upon depletion of glucose. Each of the genes 
profiled here contains one or more RAP1 -binding motifs upstream of its promoter. RAP1 is a transcrip- 
tional regulator of most ribosomaJ proteins. 
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tion, the bound DNA was denatured by a 2-min in- 
cubation in distilled water at ~95°C. The slides were 
then transferred into a bath of 1 00% ethanol at room 
temperature, rinsed, and then spun dry in a cfinicaJ 
centrifuge. Slides were stored in a closed box at 
room temperature until used. 

10. YPD medium (8 liters), in a 10-liter fermentation 
vessel, was inoculated with 2 ml of a fresh over- 
night culture of yeast strain DBY7286 (MATa, ura3. 
GAL2). The fermentor was maintained at 30°C with 
constant agitation and aeration. The glucose con- 
tent of the media was measured with a UV test kit 
(Boehringer Mannheim, catalog number 716251) 
Cell density was measured by OD at 600-nm wave- 
length. Aliquots of culture were rapidly withdrawn 
from the fermentation vessel by peristaltic pump, 
spun down at room temperature, and then flash 
frozen with liquid nitrogen. Frozen cells were stored 
at-80°C. 

1 1 . Cy3-dUTP or Cy5-dUTP (Amersham) was incorpo- 
rated during reverse transcription of 1 .25 \lq of 
polyadenylated [potytA)""! RNA. primed by a cT(16) 
oligomer. This mixture was heated to 70 fl C for 10 
min, and then transferred to ice. A premixed solu- 
tion, consisting of 200 U Superscript II (Gibco), 
buffer, deoxyribonucieoside triphosphates, and flu- 
orescent nucleotides, was added to the RNA. Nu- 
cleotides were used at these final concentrations: 
500 m-M for dATP. dCTP. and dGTP and 200 »iM 
for dTTP. Cy3-dUTP and Cy5-dUTP were used at 
a final concentration of 1 00 m-M. The reaction was 
then incubated at 42°C for 2 hours. Unincorporat- 
ed fluorescent nucleotides were removed by first 
diluting the reaction mixture with of 470 yJ of 10 
mM tris-HO (pH 8.0)/1 mM EDTA and then subse- 
quently concentrating the mix to -5 pJ, using Cen- 
tricon-30 microconcentrators (Amicon). 

1 2. Purified, labeled cDNA was resuspended in 1 1 p.1 of 
3.5 x SSC containing 10 polyidA) and 0.3 of 
10% SDS. Before hybridization, the solution was 
boiled for 2 min and then allowed to cool to room 
temperature. The solution was applied to the mi- 
croarray under a cover slip, and the slide was 
placed in a custom hybridization chamber which 
was subsequently incubated for -8 to 1 2 hours in 
a water bath at 62*C. Before scanning, slides were 
washed in 2x SSC. 0.2% SDS for 5 min, and then 
0.05 x SSC for 1 min. Slides were dried before 
scanning by centrifugation at 500 rpm in a Beck- 
man CS-6R centrifuge. 

13. The complete data set is available on the Internet at 
cmgm.stanford.edu/pbrown/explore/rndex.html 

1 4. For 95% of all the genes analyzed, the mRNA levels 
measured in cells harvested at the first and second 
interval after inoculation differed by a factor of less 
than 1 .5. The correlation coefficient for the compar- 
ison between mRNA levels measured for each gene 
in these two different mRNA samples was 0.98. 
When duplicate mRNA preparations from the same 
cell sample were compared in the same way, the 
correlation coefficient between the expression levels 
measured for the two samples by comparative hy- 
bridization was 0.99. 

15. The numbers and identities of known and putative 
genes, and their homologies to other genes, were 
gathered from the following public databases: Sac- 
charomyces Genome Database (genome-www. 
stanford.edu), Yeast Protein Database (quest7. 
proteome.com), and Munich Information Centre for 
Protein Sequences (speedy.mips.biochem.mpg.de/ 
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We describe here a method for drug target validation and identification of secondary drug tar- 
get effects based on genome-wide gene expression patterns. The method is demonstrated by 
several experiments, including treatment of yeast mutant strains defective in calcineurin, im- 
munophilins or other genes with the immunosuppressants cyclosporin A or FK506. Presence or 
absence of the characteristic drug 'signature' pattern of altered gene expression in drug-treated 
cells with a mutation in the gene encoding a putative target established whether that target was 
required to generate the drug signature. Drug dependent effects were seen in 'targetless' cells, 
showing that FK506 affects additional pathways independent of calcineurin and the im- 
munophilins. The described method permits the direct confirmation of drug targets and recog- 
nition of drug-dependent changes in gene expression that are modulated through pathways 
distinct from the drug's intended target. Such a method may prove useful in improving the effi- 
ciency of drug development programs. 



Good drugs are potent and specific; that is, they must have 
strong effects on a specific biological pathway and minimal ef- 
fects on all other pathways. Confirmation that a compound in- 
hibits the intended target (drug target validation) and the 
identification of undesirable secondary effects are among the 
main challenges in developing new drugs. Comprehensive 
methods that enable researchers to determine which genes or 
activities are affected by a given drug might improve the effi- 
ciency of the drug discovery process by quickly identifying po- 
tential protein targets, or by accelerating the identification of 
compounds likely to be toxic. DNA microarray technology., 
which permits simultaneous measurement of the expression 
levels of thousands of genes, provides a comprehensive frame- 
work to determine how a compound affects cellular metabolism 
and regulation on a genomic scale 1 " 1 '. DNA microarrays that 
contain essentially every open reading frame (ORF) in the 
Saccharomyces cerevisiae genome have already been used success- 
fully to explore the changes in gene expression that accompany 
large changes in cellular metabolism or cell cycle progression 710 . 

In the modern drug discovery paradigm, which typically be- 
gins with the selection of a single molecular target, the ideal in- 
hibitory drug is one that inhibits a single gene product so 
completely and so specifically that it is as if the gene product 
were absent. Treating cells with such a drug should induce 
changes in gene expression very similar to those resulting from 
deleting the gene encoding the drug's target. Here we have com- 
pared the genome-wide effects on gene expression that result 
from deletions of various genes in the budding yeast 5. cerevisiae 
to the effects on gene expression that result from treatment 



with known inhibitors of those gene products. Using the cal- 
cineurin signaling pathway as a model system, we tested an ap- 
proach that permits identification of genes that encode proteins 
specifically involved in pathways affected by a drug. The FK506 
characteristic pattern, or signature', of altered gene expression 
was not observed in mutant cells lacking proteins inhibited by 
FK506 (for example, a calcineurin or FK506-binding-protein 
mutant strain), but was observed in mutants deleted for genes 
in pathways unrelated to FK506 action (for example, a cy- 
clophilin mutant strain). Conversely, the cyclosporin A (CsA) 
signature was not observed in CsA -treated calcineurin or cy- 
clophilin mutant strains, but was seen in an FK506-binding-pro- 
tein mutant strain treated with CsA. The method also 
demonstrates that FK506. a clinically used immunosuppressant, 
has 'off-target' effects that are independent of its binding to im- 
munophilins. Thus, the approach we describe may provide a 
way to identify the pathways altered by a drug and to detect 
drug effects mediated through unintended targets. 

Null mutants phenocopy drug-treated cells on a genomic scale 
To test whether a null mutation in a drug target serves as a 
model of an ideal inhibitory drug, we examined the effects on 
gene expression associated with pharmacological or genetic in- 
hibition of calcineurin function. Calcineurin is a highly con- 
served calcium- and calmodulin-activated serine/threonine 
protein phosphatase implicated in diverse processes dependent 
on calcium signaling 12 * 13 . In budding yeast, calcineurin is re- 
quired for intracellular ion homeostasis 14 , for adaptation to pro- 
longed mating pheromone treatment" and in the regulation of 
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Fig. 1 Model of antagonism of the calcineurin signaling pathway mediated 
by FK506 and cyclosporin A (CsA). Calcineurin activity is composed of a cat- 
alytic subunit (calcineurin A. encoded in yeast by the CNA1 and CAM? genes), 
and calcium-binding regulatory subunits calmodulin (CMD) and calcineurin B 
(CnB). After entering cells, FK506 and CsA specifically bind and inhibit the 
peptidyl-proline isomerase activity of their respective immunophilins, FK506 
binding proteins (FKBP) and cyclophilins (CyP). The most abundant im- 
munophilins in yeast (Fprl and Cph1) are thought to mediate calcineurin in- 
hibition. Drug-immunophilin complexes bind and inhibit the calcium- and 
calmodulin-stimulated phosphatase calcineurin. Among the substrates of cal- 
cineurin are transcriptional activators that act to modulate gene expression. 



the onset of mitosis 16 . In mammals, calcineurin has been impli- 
cated in T-cell activation 12 , in apoptosis 17 . in cardiac hypertro- 
phy 18 and in the transition from short-term to long-term 
memory". In both organisms, calcineurin activity is inhibited 
by FK506 and CsA. immunosuppressant drugs whose effects on 
calcineurin are mediated through families of intracellular recep- 
tor proteins called immunophilins 1 2,20 (Fig. 1). To assess the ef- 
fects of pharmacologic inhibition of calcineurin, wild-type 5. 
cerevisiae was grown to early logarithmic phase in the presence 
or absence of FK506 or CsA. Isogenic cells, from which the 
genes encoding the catalytic subunits of calcineurin {CNA1 and 
CNA2) had been deleted 21 (referred to as the cna or calcineurin 
mutant), were grown in parallel, in the absence of the drug. 
Fluorescently-labeled cDNA was prepared by reverse transcrip- 
tion of polyA* RNA in the presence of Cy3- or Cy5-deoxynu- 
cleotide triphosphates and then hybridized to a microarray 
containing more than 6,000 DNA probes representing 97% of 
the known or predicted ORFs in the yeast genome. 
Simultaneous hybridization of Cy5-labeled cDNA from mock- 
treated cells and Cy3-labeled cDNA from cells treated with 1 
ug/ml FK506 allowed the effect of drug treatment on mRNA lev- 
els of each ORF to be determined (Fig. 2a and b and data not 
shown). Similarly, effects of the calcineurin mutations on the 
mRNA levels of each gene were assessed by simultaneous hy- 
bridization of Cy5-labeled cDNA from wild-type cells and Cy3- 
labeled cDNA from the calcineurin mutant strain (Fig. 2c). For 
each comparison of this kind, reported expression ratios are the 
average of at least two hybridizations in which the Cy3 and Cy5 
fluors were reversed to remove biases that may be introduced by 
gene-specific differences in incorporation of the two fluors 
(data not shown). 

Treatment with FK506 in these growth conditions resulted in 
a signature pattern of altered gene expression in which mRNA 
levels of 36 ORFs changed by more than twofold 
(http://www.rosetta.org). A very similar pattern of altered gene 
expression was observed when the calcineurin mutant strain 
was compared to wild-type cells. Comparison of the changes in 
mRNA expression of each gene resulting from treatment of 
wild-type cells with FK506 with mRNA expression changes re- 
sulting from deletion of the calcineurin genes showed the con- 
siderable similarity of the global transcript alterations in 
response to the two perturbations (Fig. 2b~d). Quantification of 
this similarity using the correlation coefficient (p) showed 
large correlations between the FK506 treatment signature and 
the calcineurin deletion signature (p = 0.75 ± 0.03), as well as 
the CsA treatment signature (p = 0.94±0.02), but not with a 
randomly selected deletion mutant strain (deleted for the 
YER071C gene; p = -0.07 ± 0.04; Fig. 2e). The FK506 treatment 
signature was also compared with those of more than 40 other 
deletion mutant strains or drug-treatments thought to affect 
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unrelated pathways, and none had statistically significant cor- 
relations. These data establish that genetic disruption of cal- 
cineurin function provides a close and specific phenocopy of 
treatment with FK506 or CsA. 

To avoid generalizing from a single example, we also com- 
pared the effects of treatment of wild-type cells with 3-aminotri- 
azole (3-AT) with the effects of deletion of the H1S3 gene. HIS3 
encodes imidazoleglycerol phosphate dehydratase, which cat- 
alyzes the seventh step of the histidine biosynthetic pathway in 
yeast 22 ; 3-AT is a competitive inhibitor of this enzyme that trig- 
gers a large transcriptional amino-acid starvation response 23 . 
Microarray analysis of wild-type and isogenic /iis3-deficient 
strains demonstrated the expected large genome- wide transcrip- 
tional responses (involving more than 1.000 ORFs) resulting 
from treatment with 3-AT (Fig. 3a) or from H1S3 deletion (Fig. 
3c). Quantitative comparison of the 3-AT treatment signature 
and the his3 mutant signature showed a high level of correlation 
(p= 0.76 ± 0.02) that even extended to genes that experienced 
small changes in expression level (Fig. 36). As a negative control, 
the correlations between the 3-AT treatment signature or the 
his3 mutant signature and the calcineurin mutant strain were 
not statistically significant (p = 0.09 ± 0.06 and -0.01 ± 0.04, re- 
spectively). That both the calcineurin/FK506 and the hls3/3-AT 
comparisons were highly correlated indicates that in many cases 
the expression profile resulting from a gene deletion closely re- 
sembles the expression profile of wild-type cells treated with an 
inhibitor of that gene s product. 

'Decoder' strategy: Drug target validation with deletion mutants 

Because pharmacological inhibition of different targets might 
give similar or identical expression profiles, simple comparison 
of drug signatures to mutant signatures is unlikely to unambigu- 
ously identify a drug's target. To overcome this limitation, an 
additional "decoder* step is used. We first compare the expres- 
sion profile of wild-type drug-treated cells to the expression pro- 
files from a panel of genetic mutant strains, using a correlation 
coefficient metric. Mutant strains whose expression profile is 
similar to that of drug-treated wild-type cells are selected and 
subjected to drug treatment, generating the drug signature in 
the mutant strain (that is, the mutant drug signature). If the 
mutated gene encodes a protein involved in a pathway affected 
by the drug, we expect the drug signature in mutant cells to be 
different (or absent, for an ideal drug) from the drug signature 
seen in wild-type cells. 
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Fig. 2 Expression profiles from 
FK506-treated wild-type (wt) 
cells and a caicineurin-disruption 
mutant strain share a genome- 
wide correlation. DNA microarray 
analysis showing changes in gene 
expression resulting from FK506 
treatment (a and b) or from ge- 
netic disruption of genes encod- 
ing calcineurin (c). a. Pseudo- 
color image of the results of si- 
multaneous hybridization of Cy5- 
labeled cDNA (red) from 
mock-treated strain R563 and Cy3-labeled cDNA 
(green) from strain R563 treated with 1 ^g/ml FK506, 
b. Enlarged view of the boxed area in a. Arrowheads in- 
dicate specinc ORFs induced or repressed, s, Pseudo- 
color image of the results of simultaneous hybridization 
of Cy5-labeled cDNA (red) from strain R563 and Cy3- 
labeled cDNA (green) from strain MCY300 (deleted for 
the CNA1.CNA2 catalytic subunits of calcineurin). 
Arrows indicate specific ORFs induced or repressed, d, 
The log™ of the expression ratio for each ORF derived 
from the FK506 treatment hybridizations is plotted ver- 
sus the log™ of the expression ratio in the calcineurin 
mutant hybridizations. ORFs that were induced or re- 
pressed in both experiments are shown as green and 
red dots, respectively. @, The log 10 of the expression ratio for each ORF de- 
rived from the FK506 treatment hybridizations is plotted versus the log 10 



wt -/♦ 1 jig/ml FK506 



wt vs. calcinenjrin mutant 




3 *M 



Log lt > (R/G) calcineurin mutation 



Log» (R/G) yer071c mutation 



of the expression ratio in the yer071c mutant hybridizations. No ORFs 
were induced or repressed in both experiments. 



To illustrate this, we treated the his3 mutant strain with 3- 
AT. The signature pattern of altered gene expression resulting 
from treatment of the mutant strain with 3-AT was much less 
complex than that of the 3-AT signature in wild-type cells (Fig. 
4). This is seen simply by examining plots of mean intensity of 
the hybridization signal (which approximately reflects level of 
expression) versus the expression ratio for each ORF (Fig. 4). 
Genes that were expressed at higher or lower levels in 3-AT 
treated cells or in his3 mutant cells are shown as red and green 
dots, respectively. We analyzed the 3-AT signature in wild-type 
(Fig. 4a) and his3 mutant cells (Fig. 4c), as well as the his3 mu- 
tant strain signature (Fig. 4b). Whereas histidine limitation in- 
duced by 3-AT induced more than 1,000 transcription-level 
changes in the wild-type strain, few or no transcript level 
changes were induced by treatment of the his3~ deletion strain 
with 3-AT. This indicates that with the growth conditions used, 
essentially all of the effects of 3-AT depend on or are mediated 
through the HIS3 gene product. 

Applying this approach to the calcineurin signaling pathway 
showed the specificity of the method. The calcineurin mutant 
strain and strains with deletions in the genes encoding the 
most abundant immunophilins in yeast 12 {CPH1 and FPR1) 
were treated with either FK506 or CsA to determine the profiles 



Table 1 Signature correlation of expression ratios as a result of FK506 
treatment in various mutant strains 





wild-type 


cna 


fprl 


cna fprl 


cphl 




+/-FK506 


4/-FK506 


+/-FK506 


V-FK506 


+/-FK506 


wild-type 












+/- FK506 


0.93 ± 0.04 


-0.01 ± 0.07 


-0.23 ± 0.07 


0.12 ±0.07 


0.79 ± 0.03 



Signature correlation shows the absence of the FK506 signature specifically in the calcineurin (cna) and fprl 
(major FK506 binding protein) deletion mutants, cna represents the mutant with deletions of the catalytic sub- 
units of calcineurin. CAM 7 and CNA2. The correlation coefficient reported in the first column represents the cor- 
relation between two pairs of hybridizations from independent wild-type W- FK506 experiments. 



of altered gene expression resulting from drug treatment of the 
mutant cells (that is. mutant +/- drug). We compared the drug 
signatures in the mutants to the wild-type drug signature using 
the correlation coefficient metric (Table 1). Although the signa- 
ture generated by treatment of wild-type cells with FK506 was 
highly correlated to the calcineurin mutant strain signature (p 
= 0.75 ± 0.03), it bore no similarity to the profile after treat- 
ment of the calcineurin mutant strain with FK506 (p = -0.01 ± 
0.07). This indicates that FK506 was unable to elicit its normal 
transcriptional response in the calcineurin mutant strain. 
Likewise, treatment of the fprl mutant strain with FK506 
elicited an expression profile that was not correlated to the 
FK506 signature in the wild-type strain (p = -0.23 ± 0.07). indi- 
cating that the FPR1 gene product is likely to be involved in the 
pathway affected by FK506. The same was true for the cna fprl 
mutant strain. In contrast, treatment of the cphl mutant strain 
with FK506 generated an expression profile highly correlated 
with the wild-type FK506 expression profile (p « 0.79 ± 0.03), 
indicating the cphl mutation did not block the mode of action 
of FK506 and thus is not directly involved in the pathway af- 
fected by FK506. We tabulated the change in expression in re- 
sponse to FK506 in different mutant strains for all ORFs with 
expression ratios greater than 1.8 in FK506-treated cells or in 
the calcineurin mutant strain (Fig. 5a) .The 
calcineurin mutant strain signature and the 
FK506 responses in wild-type and the cphl 
mutant strain are similar, and there are no 
transcript-level changes (seen in black) for 
treatment of the calcineurin. fprl and cna 
fprl mutant strains with FK506 (Fig. 5a). 

Similar experiments and analyses with CsA 
provided further validation of this approach. 
The expression profile elicited by treatment 
of wild-type cells with CsA was highly corre- 
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wt-/* lOmM 3-AT 



Fig. 3 Expression profiles 
from a his3 mutant strain 
and wild-type (wt) cells 
treated with 3-AT share a 
genome-wide correlation. 
DNA microarray analysis 
showing changes in gene 
expression resulting from 3- 
AT treatment (a) or from ge- 
netic disruption of the HtS3 
gene (c). a. Pseudo-color 
image of the results of simul- 
taneous hybridization of 
Cy5-labeled cDNA (red) from mock-treated wild-type strain R491 and 
Cy3-labeled cDNA (green) from strain R491 treated with 10 mM 3-AT. 
b, Plot of the log 10 of the expression ratio for each ORF derived from the 
3-AT treatment hybridizations is plotted versus the log 10 of the expression 
ratio in the hi$3 mutant hybridizations. ORFs that were induced or re- 
pressed in both experiments are shown as green and red dots, respec- 
tively. The correlation of expression ratios applies not only to genes with 
large expression ratios (for example, CHAl and ARG1), but also extends to 
genes with expression ratios less than 2 (for example, ILV1 and CPH1). 
fiVI is induced 1 .9-fold and 1 .5-fold, and CPH1 is downregulated 1 .9-fold 
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and 1 .7-fold, in cells treated with 3-AT and his3 mutant cells, respectively. 
Two ORFs do not fall on the line x « y. The leftmost point is the HIS3 data 
point, which is induced by 3-AT treatment but which is not absent from 
the hi$3 mutant strain. The other point is Y0R203w. Both data points are 
labeled HIS3 because hybridization to YOR203w is most likely due to HIS3 
mRNA, as YOR203w overlaps the HIS3 open reading frame. *, Pseudo- 
color image of the results of simultaneous hybridization of Cy5-labeled 
cDNA (red) from wild-type strain R491 and Cy3-labeled cDNA (green) 
from strain R1226, deleted for the HIS3 gene. Arrowheads indicate spe- 
cific ORFs induced or repressed. 



lated to the profile elicited by mutation of the calcineurin genes 
(p = 0.71 ± 0.04), but did not correlate with the expression pro- 
file resulting from treatment of the calcineurin mutant strain 
with CsA (p = -0.05 ± 0.07; Table 2), indicating that the genetic 
deletion of calcineurin interfered with the ability of CsA to 
elicit its normal transcriptional response. Likewise, the CsA sig- 
nature was essentially absent in CsA-treated cphl mutant cells, 
and the expression profile of CsA-treated cphl mutant cells cor- 
related poorly to that of CsA-treated wild-type cells (p = 0.18 ± 
0.07). Thus, the CPH1 gene product was required for the CsA re- 
sponse seen in wild-type cells. Conversely, treatment of fprl 
mutant cells with CsA resulted in an expression pattern very 
similar to the profile of CsA-treated wild-type cells (p = 0.77 ± 
0.03). indicating that FPR1 was not necessary for the CsA-medi- 
ated effects. Analysis of individual ORFs affected by CsA and 
their expression ratios over the entire set of experiments con- 
firmed that CPHl and the genes encoding calcineurin, but not 



B wl-/*10mM3-AT 




Log w (intensity) 



Fig, 4 Treatment of the his3 mutant strain with 3-AT shows nearly com- 
plete loss of 3-AT signature. A plot of the log 10 of the mean intensity of hy- 
bridization for each ORF versus the log 10 of its expression ratio for each 
experiment is shown next to a pseudo-color image of a representative 
portion of the microarray. ORFs that are induced or repressed at the 95% 
confidence level are shown in green and red, respectively, a, Expression 
profile from treatment of the wild-type (wt) strain with 3-AT. Cy5-labeled 
cDNA (red) from mock-treated strain R491 and Cy3-labeled cDNA 
(green) from strain R491 treated with 10 mM 3-AT. b, Expression profile 



FPR1, are necessary for the wild-type CsA response (Fig. 5b). The 
observation that the profiles resulting from FK506 or CsA drug 
treatment are similar to that of the calcineurin deletion mutant 
strain might allow the prediction that calcineurin was involved 
in the pathway affected by these drugs. But because the expres- 
sion profile of the fprl mutant strain did not bear a strong simi- 
larity to the wild-type drug expression profile for FK506, it is 
obvious that the drug treatment of the mutant strains was nec- 
essary to identify Fprl , but not Cphl . as a potential FK506 drug 
target. In the same way, the .'decoder* strategy was necessary to 
identify Cphl. but not Fprl. as a potential drug target for CsA. 

'Decoder' approach can identify secondary drug effects 

For a drug that has a single biochemical target, the strategy out- 
lined above may be useful in target validation. In many cases, 
however, a compound may affect multiple pathways and elicit 
a very complex signature. 'Decoding' such a complex signature 



his3 mutant 10 mM 3-AT 




Log,,, (intensity) 



from the his3 deletion strain. Cy5-labeled cDNA (red) from strain R491 
and Cy3-labeled cDNA (green) from strain R1226. deleted for the HIS3 
gene, s, Expression profile of treatment of the his3 deletion strain with 3- 
AT. Cy3-iabeled cDNA (red) from n/s3-deleted strain R1226 and Cy5-la- 
beled cDNA (green) from strain R1226 treated with 10 mM 3-AT. 
Arrowheads indicate the DNA probe and data point corresponding to the 
HIS3 gene. The blue dashed line represents the threshold below which er- 
rors tend to increase rapidly because spot intensities are not sufficiently 
above background intensity. 
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Table 2 Signature correlation of expression ratios as a result of CsA 
treatment in various mutant strains 



wild-type 


cna 


fprl 


cna cphl 


cphl 


+/-CsA 


+/-CsA 


+/-CsA 


4/-CsA 


+/-CsA 


wild-type 










♦/-CsA 0.94 ±0.04 


-0.05 i .07 


0.77 ± 0.03 


-0.11 ±0.07 


0.18 ±0.07 



Signature correlation shows the absence of the CsA signature specifically in the calcineurin (cna) and cphl 
(cyclophilin) deletion mutants, cna represents the mutant with deletions of the catalytic subunits of cal- 
cineurin, CNA1 and CNA2. The correlation coefficient reported in the first column represents the correlation 
between two pairs of hybridizations from independent wild-type CsA experiments. 



both the calcineurin and GCN4 pathways. The 
simplest explanation is that FK506 inhibits or 
activates additional pathways. Members of this 
class include SNQ2 and PDR5. genes that en- 
code drug efflux pumps with structural homol- 
ogy to mammalian multiple drug resistance 
proteins 26 . FK506 may interact directly with 



into the effects mediated through the intended target (the on- 
target signature') and those mediated through unintended tar- 
gets (the 'off-target' signature) might be useful in evaluating a 
compound's specificity. Our 'decoder* strategy is based on the 
premise that *off-target' signature should be insensitive to the 
genetic disruption of the primary target. 

To determine whether the 'decoder' approach could identify 
an 'off-target' profile, we looked for a drug-responsive gene 
whose expression is insensitive to deletion of the primary tar- 
get. To increase the likelihood of observing such genes, the 
same strains described in Tables 1 and 2 were treated with 
higher concentrations (50 ug/ml) of FK506. This led to a much 
more complex expression profile in wild-type cells, indicating 
that at this higher concentration, FK506 was inhibiting or acti- 
vating additional targets. Several of the ORFs in this expanded 
FK506-induced expression profile were not affected by the cal- 
cineurin, cphl or fprl mutations, as drug treatment of these mu- 
tant strains did not block their presence in the FK506 
expression signature (Fig. 6). This indicates that FK506 was trig- 
gering changes in transcript levels of many genes through path- 
ways independent of calcineurin, CPHl and FPRL Many of the 
upregulated ORFs in the 'off-target' pathway were genes re- 
ported to be regulated by the transcriptional activator Gcn4 
(ref. 24). In some strains, a reporter gene under GCN4 control 
was induced in response to FK506 treatment". To determine 
whether GCN4 is involved in this pathway that is independent 
of calcineurin, CPHl and FPR1, we analyzed the effects of treat- 
ment with high-dose FK506 on global, gene expression in a 
strain with a GCN4 deletion (Fig. 6). Of the 41 ORFs with cal- 
cineurin-independent expression ratios greater than 4. 32 were 
not induced in the gcn4 mutant, indicating that their induction 
by FK506 was GCA/4-dependent. Not all GCW4-regulated genes 
were induced by FK506. This FK506-induced subset of GCN4- 
regulated genes may be those most sensitive to subtle changes 
in Gcn4 levels, or perhaps other regulatory circuits prevent 
FK506 activation of some CC7V4-regulated genes. Seven of the 
remaining nine ORFs induced by FK506 were independent of 

Fig. 5 Response of FK506 and CsA signature genes in sua ins with deletions 
in different genes. Genes with expression ratios greater than a factor of 1 .8 in 
response to treatment with 1 pg/ml FK506 (a) or 50 ug/ml CsA (b) are listed 
(left side) and their expression ratios in the indicated strain are shown on the 
green (induction)-red (repression) color scale, a, Calcineurin (cna) mutant 
and FK506 treatment signature genes are in the first two columns. Almost all 
FK506 signature genes have expression ratios near unity in deletion strains 
involved in pathways affected by FK506 (calcineurin. fpr 7 and cna fprl mu- 
tants) but not in deletion strains in unrelated pathways (cphl). b, Calcineurin 
(cna) mutant and CsA treatment signature genes are in the first two 
columns. Almost all CsA signature genes have expression ratios near unity in 
deletion strains involved in pathways affected by CsA (calcineurin, cphl and 
cna cphl mutants) but not in deletion strains in unrelated pathways (fprl). 



Pdr5 to inhibit its function 27 . Our results indi- 
cate that treatment with FK506 leads to four- 
fold-to-sixfold induction of PDR5 mRN A levels. 
YOR1, another gene that can confer drug resis- 
tance, is also induced threefold-to-fourfold by 
FK506. Thus, drug treatment of strains with mutations in the 
primary targets can prove useful in identifying effects mediated 
by secondary drug targets, including the nature and extent of 
newly discovered and previously unsuspected pathways af- 
fected by the drug. 

We describe here a method for drug target validation and the 
identification of secondary drug target effects that uses DNA mi- 
croarrays to survey the effects of drugs on global gene expres- 
sion patterns. We established that genetic and pharmacologic 
inhibition of gene function can result in extremely similar 
changes in gene expression. We also demonstrated that one can 
confirm a potential drug target by treating a deletion mutant 
defective in the gene encoding the putative target. Drug-medi- 
ated signatures from strains with mutations in pathways or 
processes directly or indirectly affected by the drug bore little or 



Strain: 



FK506: 
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cna cna fprl 
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no similarity to the wild-type drug expression profile. In con- 
trast, drug-mediated signatures from strains with mutations in 
genes involved in pathways unrelated to the drug's action 
showed extensive similarity to the wild-type drug signature. By 
applying this approach to a drug that affects multiple pathways 
(FK506), we were able to decode a complex signature into com- 
ponent parts, including the identification of an 'off-target* sig- 
nature that was mediated through pathways independent of 
calcineurin or the Fpri immunophilin. 

Discussion 

It is well-established that high-throughput biochemical screen- 
ing can identify potent inhibitory compounds against a given 
target. The 'decoder' approach described here complements 
this process by evaluating the equally important property of 
specificity: the tendency of a compound to inhibit pathways 
other than that of its intended target. The ability to observe 
such 'off-target* effects will likely be useful in several ways. 
Profiling compounds with known toxicities will allow the de- 
velopment of a database of expression changes associated with 
particular toxicities. Recognition of potential toxicities in the 
'off-target' signatures of otherwise promising compounds then 
may allow earlier identification of those likely to fail in clinical 
trials. Comparing the extent and peculiarities of 'off-target' sig- 
natures of promising drug candiates could provide a new way 
to group compounds by their effects on secondary pathways, 
even before those effects are understood. This may prove to be 
an alternative, potentially more effective, way to select com- 
pounds for animal and clinical trials. Some drugs are more ef- 
fective against a related protein than against the originally 
intended target. Sildenafil (Viagra™), for example, was initially 
developed as a phosphodiesterase inhibitor to control cardiac 
contractility, but was found to be highly specific for phospho- 
diesterase 5, an isozyme whose inhibition overcomes defects in 



Fig, 6 Response of FK506 signature genes in strains with deletions 
in different genes. Genes with expression ratios greater than a factor 
of 4 in at least one experiment are listed and their expression ratios in 
the indicated strain are shown in the green (indue tion)-red (repres- 
sion) color scale. The genes have been divided into classes corre- 
sponding to these expected behaviors: 'CAM-dependent* genes 
respond to FK506 (50 Kg/ml) except when either calcineurin genes or 
FPRI or both are deleted: 'GCA/4-dependent*. genes respond to FK506 
except when GCNA is deleted. These genes still respond to FK506 
when calcineurin genes or FPRI or CP HI are deleted; that is, their re- 
sponses are not mediated by calcineurin, Cph1, or Fpri. 'CNA- and 
GCA/4-independent' genes respond to FK506 in all deletion strains 
tested. A 'complex behavior' class is provided for those genes that did 
not match the model of FK506 response mediated through cal- 
cineurin or Fpri 'or separately through Gcn4. 



penile erection. It is possible that application of the *de- 
coder' to other compounds may show that they too have a 
potent activity against a target distinct from their in- 
tended target. 

The ability to decode drug effects is dependent on the 
availability of functionally targetless' cells. In yeast, this 
is being achieved by systematically disrupting each yeast 
gene {Saccharomyces Deletion Consortium: http://se- 
quence-www.stanford.edu/group/yeast_deletion_pro- 
ject/deletion.html). Efforts are underway to obtain 
==J expression profiles from each deletion mutant strain. 
Determining signatures resulting from inactivation of es- 
sential genes presents a unique problem, but it may be 
possible to do so by examining heterozygotes or by using a con- 
trollable promoter to reduce expression of the essential gene. 
Although it is already feasible to test several compounds in 
dozens of yeast strains, another challenge for the 'decoder' 
strategy will be the efficient selection of the mutants with dele- 
tions in genes most likely to encode the intended drug target. 
The signature correlation plots described are one metric that 
could be used as part of that selection process, but others need 
to be explored. Applying the 'decoder' to mammalian cells pre- 
sents additional challenges. It is considerably more difficult to 
isolate functionally targetless' cells. Strategies involving titrat- 
able promoters, known specific inhibitors, anti-sense RNAs, ri- 
bozymes, and methods of targeting specific proteins for 
degradation are possible and should be tested. Another limita- 
tion is that not all cell types express the same set of genes and 
therefore 'off-targef effects may be different in different cell 
types. In addition, applying the decoder* to human cells will 
also require technical improvements that allow expression pro- 
filing from a small number of cells. Even the broader question 
of whether the insensitivity of 'off-target' signatures to the dis- 
ruption of the main target is the exception or the rule can only 
be answered by the accumulation of more data. Barkai and 
Leibler. however, have argued in favor of robustness of biologi- 
cal networks, indicating that drug perturbations ('off-target* 
signatures) may be robust even when the system is subjected to 
another perturbation (such as a genetic disruption) (ref. 28). 
Many practical developments will be necessary if the 'decoder' 
concept is to be broadly applied. 

Expression arrays have been used mainly as an initial screen 
for genes induced in a particular tissue or process of interest by 
focusing on genes with large expression ratios. We have 
found, however, that effort to refine experimental protocols 
and repeat experiments increases the reliability of the data and 
permits new applications. For example, it provides a larger set 



1298 



NATURE MEDICINE • VOLUME 4 - NUMBER 11 ■ NOVEMBER 1998 



^ 1 998 Nature America Inc. • http://medicine.nature.com 



AMTECLES 



Table 3 Yeast strains used 



Strain 


Relevant genotype 


Reference 


V/flL J M Oft 

YPH499 




(34) 


R563 


Mala UlaJ'QC fj^£ m Ow 1 aUC£' IUI up 1 ~£XOJ fll5J-£i£UU ISU£'£i I /7J5J..n/j%? 


ft hk stiifftri • 


R558 


Mala U> 0<J d£ Ij3£. 0\J 1 aUC£ 1 V 1 Up/ ilOJ IIUJ Otl/U lcU£'0 i IjJf (..n/JJ 


fthfc 5turlvl 


R567 


Mala Uf aO'Q£ ljb£ OU 1 auc£' t U 1 Iff/I -iJOJ ni3j*£}£UU icU£-£i 1 Cpil / ..ri/JJ 




MCY300. 


Mata ura3-52 ly$2-801 ade2-W1 trp1>A63 his3-A200 Ieu2-A1 cna1A1::hisGcna2A1::HtS3 


(21) 


Kl Jt 


Mats ura3-52 Iys2'801 ade2' 101 trp1-A63 h'ts3-A200 Ieu2-A1 cnalA1::hisG cna2A1 ::HIS3 cph1::kaif 


(this study) 


R133 


Mata ura3-52 Iys2-801 ade2-W1 trp1-A63 his3-A200 Ieu2-A1 cna!A1::hisGcna2A1::HIS3 fpr1::karf 


(this study) 


RS59 


Mata ura3-52 Iys2-801 ade2-W1 trphA63 his3-A200 leu2>A1 his3::HIS3 gcn4::LEU2 


(this study) 


BY4719 


Mata trp1-A63 ura3-A0 


(35) 


BY4738 


Mata trp 7 -A 63 ura3~A0 


(35) 


R491 


Mata/a BY4719 XBY4738 


(this study) 


BY4728 


Mata his3-A200trp1-A63 ura3-A0 


(35) 


BY4729 


Mata his3-A 200 trp1-A63 ura3>A0 


(35) 


R1226 


Mata/a BY4728XBY4729 


(this study) 



of genes at higher confidence levels that serve as a more 
unique signature for a given protein perturbation. In addition, 
it allows subtle signatures to be detected, when, for example, a 
protein is only partially inhibited. This may enable clinical 
monitoring of small changes in protein function in disease or 
toxicity states before they could otherwise be detected. 
Because the functions of many genes detected on transcript ar- 
rays are known, these microarrays are powerful tools that pro- 
vide detailed information about a cell's physiology. For 
example, changes in the flux through a metabolic pathway are 
reflected in transcriptional changes in genes in the pathway 7 . 
Furthermore, it may be possible to indirectly measure protein 
activity levels from expression profiling data (S.F.. et a/., un- 
published data). Thus, although the eventual development of 
genomic methods allowing the direct measurement of all cel- 
lular protein levels will be an important achievement, tran- 
script array technology offers an immediate and robust means 
of evaluating the effects of various treatments on gene expres- 
sion and protein function. 

Methods 

Construction, growth and drug treatment of yeast strains. The strains 
used in this study (Table 3) were constructed by standard techniques". 
To construct strain R559, strain R563 was transformed to Leu* with plas- 
mid pM12 digested by Sali and Midi (provided by A. Hinnebusch and T. 
Dever). Strains R132 and R133 were constructed by transforming the bac- 
terial kanamycin resistance cassette 30 flanked by genomic DNA from the 
CPH1 and FPR1 loci, respectively, and selecting for G4T8-resistant 
colonies. For experiments with FK506, cells were grown for three genera- 
tions to a density of 1 x 10 7 cells/ml in YAPD medium (YPD plus 0.004% 
adenine) supplemented with 10 mM calcium chloride as described 31 . 
Where indicated, FK506 was added to a final concentration of 1 ng/ml 
0.5 h after inoculation of the culture or to 50 jig/ml 1 h before cells were 
collected. CsA was used at a final concentration of 50 ug/ml. Cells were 
broken by standard procedures" with the following modifications: Cell 
pellets were resuspended in breaking buffer (0.2 M Tris HCI pH 7.6, 0.5 M 
NaCI, 10 mM EDTA, 1% SDS), vortexed for 2 min on a VWR multi-tube 
vortexer at setting 8 in the presence of 60% glass beads (425-600 \xm 
mesh; Sigma) and phenolxhloroform (50:50, volume/volume). After sep- 
aration of the phases, the aqueous phase was re-extracted and ethanol- 
precipitated. Poly A* RNA was isolated by two sequential 
chromatographic purifications over oligo dT cellulose (New England 
Biolabs. Beverly, Massachusetts) using established protocols". 

For experiments using 3-AT, wild-type or his3/hi$3 cells were grown to 
early logarithmic phase in SC medium, pelleted and resuspended in SC 
medium lacking histidine for 1 hr in the presence or absence of 10 mM 3- 



AT, as indicated. Cells were harvested and mRNA isolated as above. 
FK506 was obtained from the Swedish Hospital Pharmacy (Seattle, 
Washington) and purified to homogeneity by ethyl acetate extraction by 
J. Simon (Fred Hutchinson Cancer Research Center, Seattle, Washington). 
CsA was obtained from Alexis Biochemicals (San Diego, California); 3-AT 
was from Sigma. 

Preparation and hybridization of the labeled sample. Fluorescently-la- 
beled cONA was prepared, purified and hybridized essentially as de- 
scribed 7 . Cy3- or Cy5-dUTP (Amersham) was incorporated into cDNA 
during reverse transcription (Superscript II; Life Technologies) and puri- 
fied by concentrating to less than 10 ul using Microcon-30 microconcen- 
trators (Amicon, Houston, Texas). Paired cDNAs were resuspended in 
20-26 pi hybridization solution (3 x SSC, 0.75 ug/ml polyA DNA, 0.2% 
SDS) and applied to the microarray under a 22- x 30-mm coverslip for 6 
h at 63 *C, alt according to a published method'. 

Fabrication and scanning of microarrays. PCR products containing 
common 5' and 3' sequences (Research Genetics, Huntsville, Alabama) 
were used as templates with amino-modified forward primer and unmod- 
ified reverse primers to PCR amplify 6,065 ORFs from the 5. cerevisiae 
genome. Our first-pass success rate was 94%. Amplification reactions that 
gave products of unexpected sizes were excluded from subsequent analy- 
sis. ORFs thai coutd not be amplified from purchased templates were am- 
plified from genomic DNA. DNA samples from 100-jil reactions were 
isopropanol-precipitated, resuspended in water, brought to a final con- 
centration of 3x SSC in a total volume of 15 ul. and transferred to 384- 
well microliter plates (Genetix Limited, Christchurch, Dorset, England). 
PCR products were spotted onto 1 x 3-inch polylysine-treated glass slides 
by a robot built essentially according to defined specifications 3 " 
(http://cmgm.stanford.edu/pbrown/MGuide). After being printed, slides 
were processed according to published protocols'. 

Microarrays were imaged on a prototype multi-frame CCD camera in 
development at Applied Precision (Issaquah. Washington). Each CCD 
image frame was approximately 2-mm square. Exposure times of 2 s in 
the Cy5 channel (white light through Chroma 618-648 nm excitation fil- 
ter, Chroma 657-727 nm emission filter) and 1 s in the Cy3 channel 
(Chroma 535-560 nm excitation filter. Chroma 570-620 nm emission fil- 
ter) were done consecutively in each frame before moving to the next, 
spatially contiguous frame. Color isolation between the Cy3 and Cy5 
channels was about 100:1 or better. Frames were 'knitted' together in 
software to make the complete images. The intensity of spots (about 100 
um) were quantified from the 10-um pixels by frame-by-frame back- 
ground subtraction and intensity averaging in each channel. Dynamic 
range of the resulting spot intensities was typically a ratio of 1.000 be- 
tween the brightest spots and the background-subtracted additive error 
level. Normalization between the channels was accomplished by normal- 
izing each channel to the mean intensities of all genes. This procedure is 
nearly equivalent to normalization between channels using the intensity 
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ratio of genomic DNA spots', but is possibly more robust, as it is based on 
the intensities of several thousand spots distributed over the array. 

Signature correlation coefficients and their confidence limits. 
Correlation coefficients between the signature ORFs of various experi- 
ments were calculated using: 

p^Ix^/aVXy/)" 1 
k k k 

where x» is the log 10 of the expression ratio for the k m gene in the x signa- 
ture, and y % is the log, 0 of the expression ratio for the k w gene in the y sig- 
nature. The summation is over those genes that were either up- or 
down-regulated in either experiment at the 95% confidence level. These 
genes each had a less than 5% chance of being actually unregulated (hav- 
ing expression ratios departing from unity due to measurement errors 
alone). This confidence level was assigned based on an error model which 
assigns a log norma I probability distribution to each gene's expression 
ratio with characteristic width based on the observed scatter in its re- 
peated measurements (repeated arrays at the same nominal experimental 
conditions) and on the individual array hybridization quality. This latter 
dependence was derived from control experiments in which both Cy3 
and Cy5 samples were derived from the same RNA sample. For large 
numbers of repeated measurements the error reduces to the observed 
scatter. For a single measurement the error is based on the array quality 
and the spot intensity. 

Random measurement errors in the x and y signatures tend to bias the 
correlation towards 2ero. In most experiments, most genes are not signif- 
icantly affected but do show small random measurement errors. Selecting 
only the *95% confidence' genes for the correlation calculation, rather 
than the entire genome, reduces this bias and makes the actual biological 
correlations more apparent. 

Correlations between a profile and itself are unity by definition. Error 
limits on the correlation are 95% confidence limits based on the individ- 
ual measurement error bars, and assuming uncorrected errors". They do 
not include the bias mentioned above; thus, a departure of p from unity 
does not necessarily mean that the underlying biological correlation is im- 
perfect. However, a correlation of 0.7 ±0.1, for example, is very signifi- 
cantly different from zero. Small (magnitude of p < 0.2) but formally 
significant correlation in the tables and text probably are due to small sys- 
tematic biases in the Cy5/Cy3 ratios that violate the assumption of inde- 
pendent measurement errors used to generate the 95% confidence 
limits. Therefore, these small correlation values should be treated as not 
significant. A likely source of uncorrected systematic bias is the partially 
corrected scanner detector nonlinearity that differently affects the Cy3 
and Cy5 detection channels. 

The 1 pg/ml FK506 treatment signature was compared with more 
than 40 unrelated deletion mutant strain or drug signatures. These con- 
trol profiles had correlation coefficients with the FK506 profile that were 
distributed around zero (mean p = -0.03) with a standard deviation of 
0.16 (data not shown), and none had correlations greater than p - 0.38. 
Similarly, the calcineurin mutant strain signature correlated well with the 
CsA treatment signature (p = 0.71 ± 0.04) but not with the signatures 
from the negative controls (mean p = -0.02 with a standard deviation of 
0.18). 

Quality controls. End-to-end checks on expression ratio measurement 
accuracy were provided by analyzing the variance in repeated hybridiza- 
tions using the same mRNA labeled with both Cy3 and Cy5, and also 
using Cy3 and Cy5 mRNA samples isolated from independent cultures of 
the same nominal strain and conditions. Biases undetected with this pro- 
cedure, such as gene-specific biases presumably due to differential incor- 
poration of Cy3- and Cy5-dUTP into cDNA, were minimized by doing 
hybridizations in fluor -reversed pairs, in which the Cy3/Cy5 labeling of 
the biological conditions was reversed in one experiment with respect to 
the other. The expression ratio for each gene is then the ratio of ratios be- 
tween the two experiments in the pair. Other biases are removed by algo- 
rithmic numerical de-trending. The magnitude of these biases in the 
absence of de-trending and fluor reversal is typically about 30% in the 
ratio, but may be as high as twofold for some ORFs. 

Expression ratios are based on mean intensities over each spot. Some 



smaller spots have fewer image pixels in the average. This does not de- 
grade accuracy noticeably until the number of pixels falls below ten, in 
which case the spot is rejected from the data set. 'Wander' of spot posi- 
tions with respect to the nominal grid is adaptively tracked in array sub- 
regions by the image processing software. Unequal spot 'wander' within 
a subregion greater than half-a-spot spacing is a difficulty for the auto- 
mated quantitating algorithms; in this case, the spot is rejected from 
analysis based on human inspection of the 'wander'. Any spots partially 
overlapping are excluded from the data set. Less than 1% of spots typi- 
cally are rejected for these reasons. 
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The Transcriptional Program in 
the Response of Human 
Fibroblasts to Serum 
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The temporal program of gene expression during a model physiological re- 
sponse of human cells, the response of fibroblasts to serum, was explored with 
a complementary DNA microarray representing about 8600 different human 
genes. Genes could be clustered into groups on the basis of their temporal 
patterns of expression in this program. Many features of the transcriptional 
program appeared to be related to the physiology of wound repair, suggesting 
that fibroblasts play a larger and richer role in this complex multicellular 
response than had previously been appreciated. 



The response of mammalian fibroblasts to 
serum has been used as a model for studying 
growth control and cell cycle progression (/). 
Normal human fibroblasts require growth 
factors for proliferation in culture; these 
growth factors are usually provided by fetal 
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bovine serum (FBS). In the absence of 
growth factors, fibroblasts enter a nondivid- 
ing state, termed G 0 , characterized by low 



metabolic activity. Addition of FBS or puri- 
fied growth factors induces proliferation of 
the fibroblasts; the changes in gene expres- 
sion that accompany this proliferative re- 
sponse have been the subject of many studies, 
and the responses of dozens of genes to se- 
rum have been characterized. 

We took a fresh look at the response of 
human fibroblasts to serum, using cDNA mi- 
croarrays representing about 8600 distinct hu- 
man genes to observe the temporal program of 
transcription that underlies this response. Pri- 
mary cultured fibroblasts from human neonatal 
foreskin were induced to enter a quiescent state 
by serum deprivation for 48 hours and then 
stimulated by addition of medium containing 
10% FBS {2). DNA microarray hybridization 
was used to measure the temporal changes in 
mRNA levels of 8613 human genes (5) at 12 
times, ranging from 15 min to 24 hours after 
serum stimulation. The cDNA made from pu- 
rified mRNA from each sample was labeled 
with the fluorescent dye Cy5 and mixed with a 
common reference probe consisting of cDNA 
made from purified mRNA from the quiescent 
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Fig. 1. The same section of 
the microarray is shown 
for three independent hy- 
bridizations comparing RNA 
isolated at the 8-hour time 
point after serum treat- 
ment to RNA from serum- 
deprived celts. Each mi- 
croarray contained 9996 
elements, including 9804 
human cDNAs. represent- 
ing 8613 different genes. 
mRNA • r from serum-de- 
prived cells was used to 
prepare cDNA labeled with 

Cy3-deoxyuridine triphosphate (dUTP), and mRNA harvested from cells at different times after serum 
stimulation was used to prepare cDNA labeled with Cy5-dUTP. The two cDNA probes were mixed and 
simultaneously hybridized to the microarTay. The image of the subsequent scan shows genes whose 
mRNAs are more abundant in the serum-deprived fibroblasts (that is, suppressed by serum treatment) 
as green spots and genes whose mRNAs are more abundant in the serum-treated fibroblasts as red 
spots. Yellow spots represent genes whose expression does not vary substantially between the two 
samples. The arrows indicate the spots representing the following genes: 1, protein disulfide isomerase- 
related protein PS; 2, IL-8 precursor; 3. EST AA057170; and 4. vascular endothelial growth factor. 
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culture (time zero) labeled with a second fluo- 
rescent dye, Cy3 (4). The color images of the 
hybridization results (Fig. 1) were made by 
representing the Cy3 fluorescent image as 
green and the Cy5 fluorescent image as red and 
merging the two color images. 

Diverse temporal profiles of gene expres- 
sion could be seen among the 8613 genes sur- 
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veyed in this experiment (Fig. 2); many of these 
genes (about halQ were unnamed expressed 
sequence tags (ESTs) (5). Although diverse 
patterns of expression were observed, the order- 
ly choreography of the expression program be- 
came apparent when the results were analyzed 
by a clustering and display method developed 
in our laboratory for analyzing genome-wide 



gene expression data (6). An example of such 
an analysis, here applied to a subset of 517 
genes whose expression changed substantially 
in response to serum (7), is shown in Fig. 2. 
The entire detailed data set underlying Fig. 
2 is available as a tab-delimited table (in 
cluster order) at the Science Web site (www. 
sciencemag.org/feature/data/984559.shl). In 
addition, the entire, larger data set for the 
complete set of genes analyzed in this exper- 
iment can be found at a Web site maintained 
by our laboratory (genome-www.stanford. 
edu/serum) (8). 

One measure of the reliability of the 
changes we observed is inherent in the ex- 
pression profiles of the genes. For most genes 
whose expression levels changed, we could 
see a gradual change over a few time points, 
which thus effectively provided independent 
measurements for almost all of the observa- 
tions. An additional check was provided by 
the inclusion of duplicate and, in a few cases, 
multiple array elements representing the 
same gene for about 5% of the genes included 
in this microarray. In addition, three indepen- 
dent hybridizations to different microarrays 
with mRNA samples from cells harvested 8 
hours after serum addition showed good cor- 
relation (Fig. 1). As an independent test, we 
measured the expression levels of several 
genes using the TaqMan 5' nuclease fluori- 
genic quantitative polymerase chain reaction 
(PCR) assay (9). The expression profiles of 
the genes, as measured by these two indepen- 
dent methods, were very similar (Fig. 3) (JO). 

The transcriptional response of fibroblasts 
to serum was extremely rapid. The immediate 
response to serum stimulation was dominated 
by genes that encode transcription factors 
and other proteins involved in signal trans- 
duction. The mRN As for several genes [in- 
cluding c-FOS, JUN B, and mitogen-acti- 
vated protein (MAP) kinase phosphatase- 1 
(MKPI)] were detectably induced within 
15 min after serum stimulation (Fig. 4, A 
and B). Fifteen of the genes that were 
observed to be induced by serum encode 
known or suspected regulators of transcrip- 
tion (Fig. 4B). All but one were immediate- 
early genes — their induction was not inhib- 
ited by cycloheximide (//). This class of 
genes could be distinguished into those 
whose induction was transient (Fig. 2, clus- 
ter E) and those whose mRNA levels re- 
mained induced for much longer (Fig. 2, 
clusters I and J). Some features of the 
immediate response appeared to be directed 
at adaptation to the initiating signals. We 
observed a marked induction of mRNA 
encoding MKPI, a dual-specific ity phos- 
phatase that modulates the activity of the 
ERK1 and ERK2 MAP kinases (12). The 
coincidence of the peak of expression of 
genes in cluster E (Fig. 2) with that of 
MKPI (Fig. 4A) suggests the possibility 



Fig. 2. Cluster image 
showing the different 
classes of gene expres- 
sion profiles. Five hun- 
dred seventeen genes 
whose mRNA levels 
changed in response to 
serum stimulation were 
selected (7). This sub- 
set of genes was clus- 
tered hierarchically into 
groups on the basis of 
the similarity of their 
expression profiles by 
the procedure of Eisen 
et at. (6). The expres- 
sion pattern of each 
gene in this set is dis- 
played here as a hori- 
zontal strip. For each 
gene, the ratio of 
mRNA levels in fibro- 
blasts at the indicat- 
ed time after serum 
stimulation ("unsync" 
denotes exponentially 
growing cells) to its 
level in the serum-de- 
prived (time zero) fi- 
broblasts is represented 
by a color, according to 
the color scale at the 
bottom. The graphs 
show the average ex- 
pression profiles for the 
genes in the corre- 
sponding "cluster" (in- 
dicated by the letters A 
to J and color coding). 
In every case examined, 
when a gene was rep- 
resented by more than 
one array element, the 
multiple representa- 
tions in this set were 
seen to have identical 
or very similar expres- 
sion profiles, and the 
profiles corresponding 
to these independent 
measurements clus- 
tered either adjacent 
or very close to each 
other, pointing to the 
robustness of the clus- 
tering algorithm in 
grouping genes with 
very similar patterns of 
expression. 
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that continued activity of the MAP kinase path- 
way is required to maintain induction of these 
genes but not of those with sustained expression 
(clusters 1 and J). The gene encoding a second 
member of the dual-specificity MAP kinase 
phosphatase family, known as dual-specificity 
protein phosphatase 6/pyst2, was induced later, 
at about 4 hours after serum stimulation. Genes 
encoding diverse other proteins with roles in 
signal transduction, ranging from cell-surface 
receptors [for example, the sphingosine 1- 
phosphate receptor (EDG-1), the vascular en- 
dothelial growth factor receptor, and the type II 
BMP receptor] to regulators of G-protein sig- 
naling (for example, NETl/pl 15 rho GEF) to 
DNA-binding transcription factors, were in- 
duced by serum (Fig. 4A). 

The reprogramming of the regulatory cir- 
cuits in response to serum involved not only 
induction of transcri prion factors but also re- 
duced expression of many transcriptional reg- 
ulators — some of which may play roles in 
maintaining the cells in G 0 or in priming 
them to react to wounding (Fig. 4C). Perhaps 
as a consequence of the historical focus on 
genes induced by serum stimulation of fibro- 
blasts, the set of transcription factors whose 
expression diminished upon serum stimula- 
tion has been less well characterized. 

Genes known or likely to be involved in 
controlling and mediating the proliferative re- 
sponse showed distinctive patterns of regula- 
tion. Several genes whose products inhibit pro- 
gression of the cell-division cycle, such as p27 
Kipl, p57 Kip2, and pi 8, were expressed in the 
quiescent fibroblasts and down-regulated be- 
fore the onset of cell division. The nadir in the 
mRN A levels for these genes occurred between 
6 and 12 hours after serum stimulation (Fig. 
5A), coincident with the passage of the fibro- 
blasts through G,. The levels of the transcript 
encoding the WEEl-like protein kinase, which 
is believed to inhibit mitosis by phosphoryl- 
ation of Cdc2, diminished between 4 and 8 to 
12 hours after serum addition (Fig. 5 A), well 



before the onset of M phase at around 16 hours, 
raising the possibility of an additional role for 
Weel in an earlier stage of the cell cycle or in 
regulating the G 0 to G, transition. Several 
genes induced in the first few hours after serum 
stimulation, such as the helix-loop-helix pro- 
teins ID2 and ID3 and EST AAO 16305, a gene 
with homology to G,-S cyclins, are candidates 
for roles in promoting the exit from G 0 . 

Genes involved in mediating progression 
through the cell cycle were characterized by a 
distinctive pattern of expression (Fig. 2, clus- 
ter D), reflecting the coincidence of their 
expression with the reentry of the stimulated 
fibroblasts into the cell-division cycle. The 
stimulated fibroblasts replicated their DNA 
about 16 hours after serum treatment. This 
timing was reflected by the induction of 
mRNA encoding both subunits of ribonucle- 
otide reductase and PCNA, the processiviry 
factor for DNA polymerase epsilon and delta. 
Cyclin A, Cyclin BI, Cdc2, and CDC28 ki- 
nase, regulators of passage through the S 
phase and the transition from G 2 to M phase, 
were induced at about 16 to 20 hours after 
serum addition. The kinase in the Cyclin 
Bl-CDK pair needs to be activated by phos- 
phorylation. The gene encoding Cyclin-de- 
pendent kinase 7 (CDK7; a homolog of Xe- 
nopus M015 cdk-activating kinase) was in- 
duced in parallel with the Cdc2 and Cdc28 
kinases (Fig. 5A), suggesting a potential role 
for CDK.7 in mediating M phase. DNA lopo- 
isomerase II a, required for chromosome seg- 
regation at mitosis; Mad2, a component of 
the spindle checkpoint that prevents comple- 
tion of mitosis (anaphase) if chromosomes 
are not attached to the spindle; and the kinet- 
ochore protein CENP-F all showed a similar 
expression profile. 

In the hours after the scrum stimulus, one of 
the most striking features of the unfolding tran- 
scriptional program was the appearance of nu- 
merous genes with known roles in processes 
relevant to the physiology of wound healing. 



These included both genes involved in the di- 
rect role played by fibroblasts in remodeling of 
the clot and the extracellular matrix and, more 
notably, genes encoding proteins involved in 
intercellular signaling (Fig. 5). Genes induced 
in this program encode products that can (i) 
participate in the dynamic process of clotting, 
clot dissolution, and remodeling and perhaps 
contribute to hemostasis by promoting local 
vasoconstriction (for example, endothelin-1); 
(ii) promote chemotaxis and activation of neu- 
trophils (for example, COX2) and recruitment 
and extravasation of monocytes and macro- 
phages (for example, MCP1); (iii) promote 
chemotaxis and activation of T lymphocytes 
[for example, interleukin-8 (IL-8)] and B 
lymphocytes (for example, ICAM-1), thus 
providing both innate and antigen-specific 
defenses against wound infection and recruit- 
ing the phagocytic cells that will be required 
to clear out the debris during remodeling of 
the wound; (iv) promote angiogenesis and 
neovascularization (for example, VEGF) 
through newly forming tissue; (v) promote 
migration and proliferation of fibroblasts (for 
example, CTGF) and their differentiation into 
myofibroblasts (for example, Vimentin); and 
(vi) promote migration and proliferation of 
keratinocytes, leading to reepithelialization 
of the wound (for example, FGF7), and pro- 
mote proliferation of melanocytes, perhaps 
contributing to wound hyperpigmentation 
(for example, FGF2). 

Coordinated regulation of groups of genes 
whose products act at different steps in a 
common process was a recurring theme. For 
example, Furin, a prohormone-processing 
protease required for one of the processing 
steps in the generation of active endothelin, 
was induced in parallel with induction of the 
gene encoding the precursor of endothelin-1 
(Fig. 5E) (13). Conversely, expression of 
CALL A/CD 10. a membrane metal lopro tease 
that degrades endothelin-1 and other peptide 
mediators of acute inflammation, was re- 




Fig. 3. Independent verification of microarray quantitation. Relative mRNA 
levels of the indicated genes (Mast, mast/stem cell growth factor receptor) 
were measured with the TaqMan 5' nuclease fluorigenic quantitative PCR 
assay (9) (left) in the same samples that were used to prepare probes for 
microarray hybridizations (right). Data from the TaqMan analysis were 




0.10 J — i 

Time 

normalized to mRNA concentrations and plotted relative to the level at 
time zero, so that the results could be compared with those from the 
microarray hybridizations. In general quantitation with the two methods 
gave very simitar results (70). 
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duced. A second example is provided by a set 
of five genes involved in the biosynthesis of 
cholesterol (Fig. 51). The mRNAs encoding 
each of these enzymes showed sharply dimin- 
ished expression beginning 4 to 6 hours after 
serum stimulation of fibroblasts. A likely ex- 
planation for the coordinated down-regula- 
tion of the cholesterol biosynthetic pathway 
is that serum provides cholesterol to fibro- 
blasts through low-density lipoproteins, 
whereas in the absence of the cholesterol 
provided by serum, endogenous cholesterol 
biosynthesis in fibroblasts is required. 

Many of the previously studied genes that 
we observed to be regulated in this program 
have no recognized role in any aspect of wound 
healing or fibroblast proliferation. Their identi- 
fication in this study may therefore point to 
previously unknown aspects of these processes. 
A few selected genes in this group are shown in 
Fig. 5H. The stanniocalcin gene, for example 
(Fig. 5H), encodes a secreted protein without a 
clearly identified function in human cells {14, 
15). Its induction in serum-stimulated fibro- 
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Fig. 4. "Reprogramming" of fibroblasts. Expres- 
sion profiles of genes whose function is likely to 
play a role in the reprogramming phase of the 
response are shown with the same representa- 
tion as in Fig. 2. In the cases in which a gene 
was represented by more than one element in 
the microarray, all measurements are shown. 
The genes were grouped into categories on the 
basis of our knowledge of their most likely role. 
Some genes with pleiotropic roles were includ- 
ed in more than one category. 



blasts suggests the possibility that it may play a 
role in the wound-healing process, perhaps 
serving as a signal in mediating inflammation 
or angiogenesis. 

One of the most important results of this 
exploration was the discovery of over 200 pre- 
viously unknown genes whose expression was 
regulated in specific temporal patterns during 
the response of fibroblasts to serum. For exam- 
ple, 13 of the 40 genes in cluster D (Fig. 2) have 
descriptive names that reflect their putative 
function. Nine of these 1 3 genes (69%) encode 
proteins that play roles in cell cycle progres- 
sion, panicularly in DNA replication and the 
G 2 -M transition. This enrichment for cell 
cycle-related genes suggests that some of the 



unnamed genes in this, cluster— for example, 
EST W79311 and EST R 13 146, neither of 
which have sequence similarity to previously 
characterized genes— may represent previously 
unknown genes involved in this part of the cell 
cycle. Similarly, a remarkable fraction of genes 
that were grouped into cluster F on the basis of 
their expression profiles encoded proteins in- 
volved in intercellular signaling (Fig. 2), sug- 
gesting that a similar role should be considered 
for the many unnamed genes in this cluster. A 
disproportionately large fraction of the genes 
whose transcription diminished upon serum 
stimulation were unnamed ESTs. 

Our intention was to use this experiment as 
a model to study the control of the transition 
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Fig. 5. The transcriptional response to serum suggests a multifaceted role for fibroblasts in the 
physiology of wound healing. The features of the transcriptional program of fibroblasts in response 
to serum stimulation that appear to be related to various aspects of the wound-healing process and 
fibroblast proliferation are shown with the same convention for representing changes in transcript 
levels as was used in Figs. 2 and 4. (A) Cell cycle and proliferation, (B) coagulation and hemostasis. 
(C) inflammation, (D) angiogenesis, (E) tissue remodeling, (F) cytoskeletal reorganization, (C) 
reepithelialization, (H) unidentified role in wound healing, and (I) cholesterol biosynthesis. The 
numbers in (C) and (C) refer to genes whose products serve as signals to neutrophils (C1), 
monocytes and macrophages (C2), T lymphocytes (C3), B lymphocytes (C4), and melanocytes (CI). 
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from G 0 to a proliferating state. However, one 
of the defining characteristics of genome-scale 
expression profiling experiments is that the ex- 
amination of so many diverse genes opens a 
window on all the processes that actually occur 
and not merely the single process one intended 
to observe. Serum, the soluble fraction of clot- 
ted blood, is normally encountered by cells in 
vivo in the context of a wound. Indeed, the 
expression program that we observed in re- 
sponse to serum suggests that fibroblasts are 
programmed to interpret the abrupt exposure to 
serum not as a general mitogenic stimulus but 
as a specific physiological signal, signifying a 
wound. The proliferarive response that we orig- 
inally intended to study appeared to be part of a 
larger physiological response of fibroblasts to a 
wound. Other features of the transcriptional 
response to serum suggest that the fibroblast is 
an active participant in a conversation among 
the diverse cells that work together in wound 
repair, interpreting, amplifying, modifying, and 
broadcasting signals controlling inflammation, 
angiogenesis, and epithelial regrowth during 
the response to an injury. 

We recognize that these in vitro results 
almost certainly represent a distorted and in- 
complete rendering of the normal physiolog- 
ical response of a fibroblast to a wound. 
Moreover, only the responses elicited directly 
by exposure of fibroblasts to serum were 
examined. The subsequent signals from other 
cellular participants in the normal wound- 
healing process would certainly provoke fur- 
ther evolution of the transcriptional program 
in fibroblasts at the site of a wound, which 
this experiment cannot reveal. Nevertheless, 
we believe that the picture that emerged 
strongly suggests a much larger and richer 
role for the fibroblast in the orchestration of 
this important physiological process than had 
previously been suspected. 
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Systematic variation in gene expression 
patterns in human cancer cell lines 
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We used cDNA microarrays to explore the variation in expression of approximately 8,000 unique genes among the 
60 cell lines used in the National Cancer Institute's screen for anti-cancer drugs. Classification of the cell lines based 
solely on the observed patterns of gene expression revealed a correspondence to the ostensible origins of the 
tumours from which the cell lines were derived. The consistent relationship between the gene expression patterns 
and the tissue of origin allowed us to recognize outliers whose previous classification appeared incorrect. Specific 
features of the gene expression patterns appeared to be related to physiological properties of the cell lines, such 
as their doubling time in culture, drug metabolism or the interferon response. Comparison of gene expression pat- 
terns in the cell lines to those observed in normal breast tissue or in breast tumour specimens revealed features of 
the expression patterns in the tumours that had recognizable counterparts in specific cell lines, reflecting the 
tumour, stromal and inflammatory components of the tumour tissue. These results provided a novel molecular 
characterization of this important group of human cell lines and their relationships to tumours in vivo. 



Introduction 

Cell lines derived from human rumours have been extensively used 
as experimental models of neoplastic disease. Although such cell 
lines differ from both normal and cancerous tissue, the inaccessi- 
bility of human tumours and normal tissue makes it likely that 
such cell lines will continue to be used as experimental models for 
the foreseeable future. The National Cancer Institute's Develop- 
mental Therapeutics Program (DTP) has carried out intensive 
studies of 60 cancer cell lines (the NCI 60) derived from tumours 
from a variety of tissues and organs 1 " 4 . The DTP has assessed many 
molecular features of the cells related to cancer and chemothera- 
peutic sensitivity, and has measured the sensitivities of these 60 cell 
lines to more than 70,000 different chemical compounds, includ- 
ing all common chemotherapeutics (http://dtp.nci .nih.gov). A 
previous analysis of these data revealed a connection between the 
pattern of activity of a drug and its method of action. In particular, 
there was a tendency for groups of drugs with similar patterns of 
activity to have related methods of action 3 * 5 " 7 . 

We used DNA microarrays to survey the variation in abun- 
dance of approximately 8,000 distinct human transcripts in these 
60 cell lines. Because of the logical connection between the func- 
tion of a gene and its pattern of expression, the correlation of gene 
expression patterns with the variation in the phenotype of the cell 
can begin the process by which the function of a gene can be 
inferred. Similarly, the patterns of expression of known genes can 



reveal novel phenotypic aspects of the cells and tissues studied 8-10 . 
Here we present an analysis of the observed patterns of gene 
expression and their relationship to phenotypic properties of the 
60 cell lines. The accompanying report n explores the relationship 
between the gene expression patterns and the drug sensitivity pro- 
files measured by the DTP. The assessment of gene expression pat- 
terns in a multitude of cell and tissue types, such as the diverse set 
of cell lines we studied here, under diverse conditions in vitro and 
in vivo, should lead to increasingly detailed maps of the human 
gene expression program and provide clues as to the physiological 
roles of uncharacterized genes 11 - 16 . The databases, plus tools for 
analysis and visualization of the data, are available (http://genome- 
www.stanford.edu/nci60 and http://discover.nci.nih.gov). 

Results 

We studied gene expression in the 60 cell lines using DNA 
microarrays prepared by robotically spotting 9,703 human 
cDNAs on glass microscope slides 17,18 . The cDNAs included 
approximately 8,000 different genes: approximately 3,700 repre- 
sented previously characterized human proteins, an additional 
1,900 had homologues in other organisms and the remaining 
2,400 were identified only by ESTs. Due to ambiguity of the iden- 
tity of the cDNA clones used in these studies, we estimated that 
approximately 80% of the genes in these experiments were cor- 
rectly identified. The identities of approximately 3,000 cDNAs 
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Fig. 1 Gene expression patterns related to the tissue of origin of the cell lines. Two-dimen- 
sional hierarchical clustering was applied to expression data from a set of 1,161 cONAs 
measured across 64 cell lines. The 1,161 cDNAs were those (of 9,703 total) with transcript 
levels that varied by at least sevenfold (log 2 (ratio) >2.8) relative to the reference pool in at 
least 4 of 60 cell lines. This effectively selected genes with the greatest variation in expres- 
sion level across the 60 cell lines (including those genes not well represented in the refer- 
ence pool), and therefore highlighted those gene expression patterns that best 
distinguished the cell lines from one another. Data from 64 hybridizations were used, one 
for each cell line plus the two additional independent representations of each of the cell 
lines K562 and MCF7. The two cell lines represented in triplicate were correspondingly 
weighted for the gene clustering so that each of the 60 cell lines contributed equally to the 
clustering, a. The cell-line dendrogram, with the terminal branches coloured to reflect the 
ostensible tissue of origin of the cell line (red. leukaemia; green, colon; pink, breast; pur- 
pie. prostate; light blue, lung; orange, ovarian; yellow, renal; grey. CNS; brown, melanoma- 
black, unknown (NCI/ADR-RES)). The scale to the right of the dendrogram depicts the cor- 
relation coefficient represented by the length of the dendrogram branches connecting 
pairs of nodes. Note that the two triplets of replicated cell lines (K562 and MCF7) cluster 
tightly together and were well differentiated from even the most closely related cell lines, 
indicating that this clustering of cell lines is based on characteristic variations in their gene 
expression patterns rather than artefacts of the experimental procedures. A coloured 
representation of the data table, with the rows (genes) and columns (cell lines) in cluster 
order. The dendrogram representing hierarchical relationships between genes was omit- 
ted for clarity, but is available (http7ygenome-www.stanford.edu/nci60). The colour in each 
* cell of this table reflects the mean-adjusted expression level of the gene (row) and cell line 
(column). The colour scale used to represent the expression ratios is shown. The labels 
'3a-3d* in (b) refer to the clusters of genes shown in detail in Fig. 3. 
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from these experiments have been sequence-verified, including 
aU of those referred to here by name. 

Each hybridization compared CyS-labelled cDNA reverse tran- 
scribed from mRNA isolated from one of the cell lines with Cy3- 
labelled cDNA reverse transcribed from a reference mRNA 
sample. This reference sample, used in all hybridizations, was 
prepared by combining an equal mixture of mRNA from 12 of 
the cell lines (chosen to maximize diversity in gene expression as 
determined primarily from two-dimensional gel studies 2 ). By 
comparing cDNA from each cell line with a common reference, 
variation in gene expression across the 60 cell lines could be 
inferred from the observed variation in the normalized Cy5/Cy3 
ratios across the hybridizations. 

To assess the contribution of artefactual sources of variation in 
the experimentally measured expression patterns, K562 and 
MCF7 cell lines were each grown in three independent cultures, 
and the entire process was carried out independently on mRNA 
extracted from each culture. The variance in the triplicate fluo- 
rescence ratio measurements approached a minimum when the 
fluorescence signal was greater than approximately 0.4% of the 
measurable total signal dynamic range above background in 
either channel of the hybridization. We selected the subset of 
spots for which significant signal was present in both the numer- 
ator and denominator of the ratios by this criterion to identify 
the best-measured spots. The pair-wise correlation coefficients 
for the triplicates of the set of genes that passed this quality con- 
trol level (6,992 spots included for the MCF7 samples and 6,161 
spots for K562) ranged from 0.83 to 0.92 (for graphs and details, 
see http://genome-www.stanford.edu/nci60). 

To make the orderly features in the data more apparent, we used 
a hierarchical clustering algorithm 19 - 20 and a pseudo-colour visu- 



alization matrix 3 * 21 . The object of the clustering was to group cell 
lines with similar repertoires of expressed genes and to group 
genes whose expression level varied among the 60 cell lines in a 
similar manner. Clustering was performed twice using different 
subsets of genes to assess the robustness of the analysis. In one case 
(Fig. 1), we concentrated on those genes that showed the most 
variation in expression among the 60 cell lines (1,167 total). A sec- 
ond analysis (Fig. 2) included all spots that were thought to be well 
measured in the reference set (6,831 spots). 

Gene expression patterns related to the histologic 
origins of the cell lines 

The most notable property of the clustered data was that cell lines 
with common presumptive tissues of origin grouped together 
(Figs lfl and 2). Cell lines derived from leukaemia, melanoma, 
central nervous system, colon, renal and ovarian tissue were clus- 
tered into independent terminal branches specific to their respec- 
tive organ types with few exceptions. Cell lines derived from 
non-small lung carcinoma and breast tumours were distributed 
in multiple different terminal branches suggesting that their gene 
expression patterns were more heterogeneous. 

Many of these coherent cell line clusters were distinguished by 
the specific expression of characteristic groups of genes 
(Fig. 3a-d). For example, a cluster of approximately 90 genes was 
highly expressed in the melanoma-derived lines (Fig. 3c). This set 
was enriched for genes with known roles in melanocyte biology, 
including tyrosinase and dopachrome tautomerase (TYR and 
DCT; two subunits of an enzyme complex involved in melanin 
synthesis 22 ), MARTI (MLANA; which is being investigated as a 
target for immunotherapy of melanoma 23 ) and SI 00- (3 (S100B; 
which has been used as an antigenic marker in the diagnosis of 
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Fig. 2 Gene expression patterns related to 
other cell-line phenotypes. a. We applied 
two-dimensional hierarchical clustering to 
expression data from a set of 6,831 cDNAs 
measured across the 64 cell lines. The 6,831 
cDNAs were those with a minimum fluores- 
cence signal intensity of approximately 0.4% 
of the dynamic range above background in 
the reference channel in each of the six 
hybridizations used to establish reproducibil- 
ity. This effectively selected those spots that 
provided the most reliable ratio measure- 
ments and therefore identified a subset of 
genes useful for exploring patterns comprised 
of those whose variation in expression across 
the 60 cell lines was of moderate magnitude. 
b. Cluster-ordered data table, c. Doubling 
time of cell lines. Cell lines are given in cluster 
order. Values are plotted relative to the mean. 
Doubling times greater than the mean are 
shown in green, those with doubling time less 
than the mean are shown in red. d. Three 
related gene clusters that were enriched for 
genes whose expression level variation was 
correlated with cell line proliferation rate. 
Each of the three gene clusters (clustered 
solely on the basis of their expression pat- 
terns) showed enrichment for sets of genes 
involved in distinct functional categories (for 
example, ri bosom a I genes versus genes 
involved in pre-RNA splicing). Gene cluster 
in which all characterized and sequence-veri- 
fied cDNAs encode genes known to be regu- 
lated by interferons. 1, Gene cluster enriched 
for genes that have been implicated in drug 
metabolism (indicated by asterisks). A further 
property of the gene clustering evident here 
and in Fig. 2 is the strong tendency for redun- 
dant representations of the same gene to 
cluster immediately adjacent to one another, 
even within larger groups of genes with very 
similar expression patterns. In addition to 
illustrating the reproducibility and consis- 
tency of the measurements, and providing 
independent confirmation of many of our 
measurements, this property also demon- 
strates that these, and probably all, genes 
have nearly unique patterns of variation 
across the 60 cell lines. If this were not the 
case, and multiple genes had identical pat- 
terns of variation, we would not expect to be 
able to distinguish, by clustering on the basis 
of expression variation, duplicate copies of 
individual genes from the other genes with 
identical expression patterns. 
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melanoma). LOXIMVI, the seventh line designated as melanoma 
in the NCI60, did not show this characteristic pattern. Although 
isolated from a patient with melanoma, LOXJMVI has previously 
been noted to lack melanin and other markers useful for identifi- 
cation of melanoma cells 1 . 

Paradoxically, two related ceU lines (MDA-MB435 and MDA- 
N), which were derived from a single patient with breast cancer 
and have been conventionally regarded as breast cancer cell lines, 
shared expression of the genes associated with melanoma. MDA- 
MB435 was isolated from a pleura] effusion in a patient with 
metastatic ductal adenocarcinoma of the breast 24,25 . It remains 
possible that the origin of the cell line was a breast cancer, and that 
its gene expression pattern is related to the neuroendocrine fea- 
tures of some breast cancers 26 . But our results suggest that this cell 
line may have originated from a melanoma, raising the possibility 
that the patient had a co-existing occult melanoma. 

The higher-level organization of the cell-line tree — in which 
groups span cell lines from different tissue types — also reflected 
shared biological properties of the tissues from which the cell 
lines were derived. The carcinoma -derived cell lines were divided 
into major branches that separated those that expressed genes 
characteristic of epithelial cells from those that expressed genes 
more typical of stromal cells. A cluster of genes is shown (Fig. 3b) 
that is most strongly expressed in cell lines derived from colon 
carcinomas, six of seven ovarian-derived cell lines and the two 
breast cancer lines positive for the oestrogen receptor. The named 
genes in this cluster have been implicated in several aspects of 
epithelial cell biology 27 . The cluster was enriched for genes whose 
products are known to localize to the basolateral membrane of 
epithelial cells, including those encoding components of 
adherens complexes (for example, desmoplakin (DSP), 
periplakin (PPL) and plakoglobin (JUP)), an epithelial- 
expressed cell-cell adhesion molecule (M4S1) and a sodium/ 
hydrogen ion exchanger 28 " 31 (SLC9A1). It also contained genes 
that encode putative transcriptional regulators of epithelial mor- 
phogenesis, a human homologue of a Drosophila melanogaster 
epithelial-expressed tumour suppressor (LLGL1) and a homeo- 
box gene thought to control calcium-mediated adherence in 
epithelial cells 32 - 33 (MSX2). 

In contrast, a separate, major branch of the cell-line dendro- 
gram (Fig. la) included all glioblastoma-derived cell lines, all 
renal-cell -carcinoma-derived cell lines and the remaining carci- 
noma-derived lines. The characteristic set of genes expressed in 
this cluster included many whose products are involved in stro- 
mal cell functions (Fig. 3d). Indeed, the two cell lines originally 
described as 'sarcoma-like' in appearance (Hs578T, breast carci- 
nosarcoma, and SF539, gliosarcoma) expressed most of these 
genes 34 * 35 . Although no single gene was uniformly characteristic 
of this cluster, each cell line showed a distinctive pattern of 
expression of genes encoding proteins with roles in synthesis or 
modification of the extracellular matrix (for example, caldesmon 
(CALD1), cathepsins, thrombospondin (THBS), lysyl oxidase 
(LOX) and collagen subtypes). Although the ovarian and most 
non- small-cell -lung-derived carcinomas expressed genes charac- 
teristic of both epithelial cells and stroma] cells, they probably 
clustered with the CNS and renal cell carcinomas in this analysis 
because genes characteristically expressed in stromal cells were 
more abundantly represented in this gene set. 

Physiological variation reflected 
in gene expression patterns 

A cluster diagram of 6,831 genes (Fig. 2) is useful for exploring 
clusters of genes whose variation in mRNA levels was not obvi- 
ously attributable to cell or tissue type. We identified some gene 
clusters that were enriched for genes involved in specific cellular 



processes; the variation in their expression levels may reflect cor- 
responding differences in activity of these processes in the cell 
lines. For example, a cluster of 1,159 genes (Fig. 2a) included 
many whose products are necessary for progression through the 
cell cycle (such as CCNA1, MCM106 and MAD2L1), RNA pro- 
cessing and translation machinery (such as RNA helicases, 
hnRNPs and translation elongation factors) and traditional 
pathologic markers used to identify proliferating cells (MK167). 
Within this large cluster were smaller clusters enriched for genes 
with more specialized roles. One cluster was highly enriched for 
numerous ribosomal genes, whereas another was more enriched 
for genes encoding RNA-splicing factors. The variation in 
expression of these ribosomal genes was significantly correlated 
with variation in the cell doubling time (correlation coefficient of 
0.54), supporting the notion that the genes in this cluster were 
regulated in relation to cell proliferation rate or growth rate in 
these cell lines. 

In a smaller gene cluster (Fig. 2d), all of the named genes were 
previously known to be regulated by interferons 13 * 36 . Additional 
groups of interferon-regulated genes showed distinct patterns of 
expression (data not shown), suggesting that the NC160 cell lines 
exhibited variation in activity of interferon-response pathways, 
which was reflected in gene expression patterns 36 . 

Another cluster (Fig. 2e) contained several genes encoding 
proteins with possible interrelated roles in drug metabolism, 
including glutamate-cysteine ligase (GLCLC, the enzyme respon- 
sible for the rate limiting step of glutathione synthesis), thiore- 
doxin (TXN) and thioredoxin reductase (TXNRD1; enzymes 
involved in regulating redox state in cells), and MRP1 (a drug 
transporter known to efficiently transport glutathione-conju- 
gated compounds 37 ). The elevated expression of this set of genes 
in a subset of these cell lines may reflect selection for resistance to 
chemotherapeutics. 

Cell lines facilitate interpretation of gene expression 
patterns in complex clinical samples 

Like many other types of cancer, tumours of the breast typically 
have a complex histological organization, with connective tissue 
and leukocytic infiltrates interwoven with tumour cells. To 
explore the possibility that variation in gene expression in the 
tumour cell lines might provide a framework for interpreting the 
expression patterns in tumour specimens, we compared RNA 
isolated from two breast cancer biopsy samples, a sample of nor- 
mal breast tissue and the NCI 60 cell lines derived from breast 
cancers (excluding MDA-MB-435 and MDA-N) and leukaemias 
(Fig. 4). This clustering highlighted features of the gene expres- 
sion pattern shared between the cancer specimens and individual 
cell lines derived from breast cancers and leukaemias. 

The genes encoding keratin 8 (KRT8) and keratin 19 (KRT19), 
as well as most of the other epithelial* genes defined in the com- 
plete NCI 60 cell line cluster, were expressed in both of the biopsy 
samples and the two breast-derived cell lines, MCF-7 and T47D, 
expressing the oestrogen receptor, suggesting that these tran- 
scripts originated in tumour cells with features similar to those of 
luminal epithelial cells (Fig. 5a). Expression of a set of genes char- 
acteristic of stromal cells, including collagen genes {COL3A1, 
COL5A1 and COL6A1) and smooth muscle cell markers 
(TAGLN)y was a feature shared by the tumour sample and the 
stromal-like cell lines Hs578T and BT549 (Fig. 5b). This feature 
of the expression pattern seen in the tumour samples is likely to 
be due to the stromal component of the tumour. The tumours 
also shared expression of a set of genes (Fig. 5c) with the multiple 
myeloma cell line (RPM1-8226), notably including 
immunoglobulin genes, consistent with the presence of B cells 
in the tumour (this was confirmed by staining with anti- 
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immunoglobulin antibodies; data not shown). Therefore, dis- 
tinct sets of genes with co-varying expression among the samples 
{Fig. 4, arrow) appear to represent distinct cell types that can be 
distinguished in breast cancer tissue. A fourth cluster of genes, 
more highly expressed in all of the cell lines than in any of the 
clinical specimens, was enriched for genes present in the prolif- 
eration' cluster described above (Fig. 5d). The variation in 
expression of these genes likely paraUeled the difference in prolif- 
eration rate between the rapidly cycling cultured cell lines and the 
much more slowly dividing cells in tissues. 

Discussion 

Newly available genomics tools allowed us to explore variation in 
gene expression on a genomic scale in 60 cell lines derived from 
diverse tumour tissues. We used a simple cluster analysis to iden- 
tify the prominent features in the gene expression patterns that 
appeared to reflect 'molecular signatures' of the tissue from 
which the cells originated. The histological characteristics of the 
cell lines that dominated the clustering were pervasive enough 
that similar relationships were revealed when alternative subsets 
of genes were selected for analysis. Additional features of the 
expression pattern may be related to variation in physiological 
attributes such as proliferation rate and activity of interferon- 
response pathways. 

The properties of the tumour-derived cell lines in this study 
have presumably all been shaped by selection for resistance to 
host defences and chemotherapeutics and for rapid proliferation 
in the tissue culture environment of synthetic growth media, fetal 
bovine serum and a polystyrene substratum. But the primary 
identifiable factor accounting for variation in gene expression 
patterns among these 60 cell lines was the identity of the tissue 
from which each cell line was ostensibly derived. For most of the 
cell lines we examined, neither physiological nor experimental 
adaptation for growth in culture was sufficient to overwrite the 
gene expression programs established during differentiation in 
vivo. Nevertheless, the prominence of mesenchymal features in 
the cell lines isolated from glioblastomas and carcinomas may 
reflect a selection for the relative ease of establishment of cell 
lines expressing stromal characteristics, perhaps combined with 
physiological adaptation to tissue culture conditions 38 " 40 . 



Fig. 4 Comparison of the gene expression patterns in clinical breast cancer 
specimens and cultured breast cancer and leukaemia cell lines, a. Two-dimen- 
sional hierarchical clustering applied to gene expression data for two breast 
cancer specimens, a lymph node metastasis from one patient, normal breast 
and the NCI60 breast and leukaemia-derived cell lines. The gene expression 
data from tissue specimens was clustered along with expression data from a 
subset of the NCI60 cell lines to explore whether features of expression pat- 
terns observed in specific lines could be identified in the tissue samples. Labels 
indicate gene clusters (shown in detail in Fig. 5) that may be related to specific 
cellular components of the tumour specimens, b, Breast cancer specimen 16 
stained with anti-keratin antibodies, showing the complex mix of cell types 
characteristically found in breast tumours. The arrows highlight the different 
cellular components of this tissue specimen that were distinguished by the 
gene expression cluster analysis (Fig. 5). 



Biological themes linking genes with related expression pat- 
terns may be inferred in many cases from the shared attributes of 
known genes within the clusters. Uncharacterized cDNAs are 
likely to encode proteins that have roles similar to those of the 
known gene products with which they appear to be co-regulated. 
Still, for several clusters of genes, we were unable to discern a com- 
mon theme linking the identified members of the cluster. Further 
exploration of their variation in expression under more diverse 
conditions and more comprehensive investigation of the physiol- 
ogy of the NCI60 cells may provide insight 10 . The relationship of 
the gene expression patterns to the drug sensitivity patterns mea- 
sured by the DTP is an example of linking variation in gene 
expression with more subtle and diverse phenotypic variation 11 . 

The patterns of gene expression measured in the NCI 60 cell 
lines provide a framework that helps to distinguish the cells that 
express specific sets of genes in the histologically complex breast 
cancer specimens 41 . Although it is now feasible to analyse gene 
expression in micro-dissected tumour specimens 42 * 4 ?, this obser- 
vation suggests that it will be possible to explore and interpret 
some of the biology of clinical tumour samples by sampling them 
intact. As is useful in conventional morphological pathology, one 
might be able to observe interactions between a tumour and its 
microenvironment in this way. These relationships will be clari- 
fied by suitable analysis of gene expression patterns from intact as 
well as dissected tumours 12,14,15 ' 41 . 

Methods 

cDNA clones. We obtained the 9,703 human cDNA clones (Research Genet- 
ics) used in these experiments as bacterial colonies in 96-weIl microtitre 
plates 9 . Approximately 8,000 distinct Unigene clusters (representing nomi- 
nally unique genes) were represented in this set of clones. All genes identi- 
fied here by name represent clones whose identities were confirmed by re- 
sequencing, or by the criteria that two or more independent cDNA clones 
ostensibly representing the same gene had nearly identical gene expression 
patterns. A single-pass 3* sequence re-verification was attempted for every 
clone after re-streaking for single colonies. For a subset of genes for which 
quality 3* sequence was not obtained, we attempted to confirm identities by 
5' sequencing. Of the subset of clones selected for 5' sequence verification 
on the basis of an interesting pattern of expression (888 total), 33 1 were cor- 
rectly identified, 57, incorrectly identified, and 500, indeterminate (poor 
quality sequence). We estimated that 15%-20% of array elements contained 
DNA representing more than one clone per well. So far, the identities of 
-3,000 clones have been verified. The full list of clones used and their nomi- 
nal identities are available (gene names preceded by the designation "SID# M 
(Stanford Identification) represent clones whose identities have not yet been 
verified; http://genome-www.stanford.edu:8000/nct60). 

Production of cDNA microarrays. The arrays used in this experiment were 
produced at Synteni Inc. (now Incyte Pharmaceuticals). Each insert was 
amplified from a bacterial colony by sampling 1 \i) of bacterial media and 
performing PCR amplification of the insert using consensus primers for 
the three plasmids represented in the clone set (5-TTCTAAAACGACG 
CCCACTC-3*, 5 -CACACAGGAAACAGCTATG-3'). Each PCR product 
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{100 nJ) was purified by gel exclusion, concentrated and resuspended in 
3xSSC (10 ul). The PCR products were then printed on treated glass 
microscope slides using a robot with four printing tips. Detailed protocols 
for assembling and operating a mi croarray printer, and printing and exper- 
imental application of DNA microarrays are available (http://cmgm. 
stanford.edu/pbrown). 

Preparation of mRNA and reference pool. Cell lines were grown from NCI 
DTP frozen stocks in RPMI- 1640 supplemented with phenol red, glutamine 
(2 mM) and 5% fetal t calf serum. To minimize the contribution of variations 
in culture conditions or cell density to differential gene expression, we grew 
each cell line to 80% confluence and isolated mRNA 24 h after transfer to 
fresh medium. The time between removal from the incubator and lysis of the 
cells in RNA stabilization buffer was minimized (<1 min). Cells were lysed in 
buffer containing guanidium isothiocyanate and total RNA was purified 
with the RNeasy purification kit (Qiagen). We purified mRNA as needed 



using a poJy(A) purification kit (Oligotex, Qiagen) according to the manu- 
facturers instructions. Denaturing agarose gel electrophoresis assessed the 
integrity and relative contamination of mRNA with ribosomal RNA. 

The breast tumours were surgically excised from patients and rapidly 
transported to the pathology laboratory, where samples for microarray 
analysis were quickly frozen in liquid nitrogen and stored at -80 °C until 
use. A frozen tumour specimen was removed from the freezer, cut into 
small pieces (-50-100 mgeach), immediately placed into 10-12 ml ofTri- 
zol reagent (Gibco-BRL) and homogenized using a PowerGen 125 Tissue 
Homogenizer (Fisher Scientific), starting at 5,000 r.p.m. and gradually 
increasing to -20,000 r.p.m. over a period of 30-60 s. We processed the Tri- 
zol/tumour homogenate as described in the Trizol protocol, including an 
initial step to remove fat. Once total RNA was obtained, we isolated mRNA 
with a FastTrack 2.0 kit (Invitrogen) using the manufacturer's protocol for 
isolating mRNA starting from total RNA. The normal breast samples were 
obtained from Clontech. 
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We combined mRNA from the following cells in equal quantities to 
make the reference pool: HL-60 (acute myeloid leukaemia) and K562 
(chronic myeloid leukaemia); NC1-H226 ( no n -small -cell -lung); COLO 
205 (colon); SNB-19 (central nervous system); LOX-IMV1 (melanoma); 
OVCAR-3 and OVCAR-^4 (ovarian); CAKI-1 (renal); PC-3 (prostate); and 
MCF7 and Hs578T (breast). The criterion for selection of the cell lines in 
the reference are described in detail in the accompanying manuscript 12 . 

Doubling-time calculations. We calculated doubling times based on rou- 
tine NCI60 cell line compound screening data; and they reflect the dou- 
bling times for cells inoculated into 96-weIl plates at the screening inocula- 
tion densities and grown in RPMI 1640 medium supplemented with 5% 
fetal bovine serum for 48 h. Wc measured cell populations using sulforho- 
damine B optical density measurement assay. The doubling time constant k 
was calculated using the equation: N/No = e kl , where No is optical density 
for control (untreated) cells at time zero, N is optical density for control cells 
after 48-h incubation, and t is 48 h. The same equation was then used with the 
derived k to calculate the doubling time t by setting N/No = 2. For a given cell 
line, we obtained No and N values by averaging optical densities (N>6,000) 
obtained for each cell line for a year's screening. Data and experimental details 
are available (http://dtp.nci.nih.gov). 

Preparation and hybridization of fluorescent labelled cDNA. For each 
comparative array hybridization, labelled cDNA was synthesized by reverse 
transcription from test cell mRNA in the presence of Cy5-dUTP, and from 
the reference mRNA with Cy3-dUTP, using the Superscript II reverse- tran- 
scription kit (Gibco-BRL). For each reverse transcription reaction, mRNA 
(2 pg) was mixed with an anchored oligo-dT (d-20T-d(AGC)) primer (4 
pg) in a total volume of 15 pi, heated to 70 °C for 10 min and cooled on ice. 
To this sample, we added an unlabelled nucleotide pool (0.6 ul; 25 mM 
each dATP, dCTP, dGTP, and 15 mM dTTP), either Cy3 or Cy5 conjugated 
dUTP (3 uj; 1 mM; Amersham), 5xfirst- strand buffer (6 pi; 250 mM Tris- 
HCL, pH 8.3, 375 mM KC1, 15 mM MgCl 2 ), 0.1 M DTT (3 ul) and 2 ul of 
Superscript II reverse transcriptase (200 p/pl). After a 2-h incubation at 42 
°C, the RNA was degraded by adding 1 N NaOH (1.5 ul) and incubating at 
70 °C for 10 min. The mixture was neutralized by adding of 1 N HCL (1.5 
pi), and the volume brought to 500 pi with TE ( 10 mM Tris, 1 mM EDTA). 
We added Cotl human DNA (20 ug; Gibco-BRL), and purified the probe 
by centrifugation in a Centricon-30 micro-concentrator (Amicon). The 
two separate probes were combined, brought to a volume of 500 ul, and 
concentrated again to a volume of less than 7 ul. We added 10 pg/pl 
poly(A) RNA (1 ul; Sigma) and tRNA (10 pg/pl; Gibco-BRL) were added, 
and adjusted the volume to 9.5 pi with distilled water. For final probe 
preparation, 20xSSC (2.1 ul; 1.5 M NaCl, 150 mM NaCitrate, pH 8.0) and 
10% SDS (0.35 pi) were added to a total final volume of 12 pi. The probes 
were denatured by heating for 2 min at 100 °C, incubated at 37 °C for 
20-30 min, and placed on the array under a 22 mmx22 mm glass coverslip. 
We incubated slides overnight at 65 °C for 14-1 8 h in a custom slide cham- 
ber with humidity maintained by a small reservoir of 3xSSC. Arrays were 
washed by submersion and agitation for 2-5 min in 2xSSC with 0.1% SDS, 
followed by lxSSC and then 0.1 xSSC. The arrays were "spun dry" by cen- 
trifugation for 2 min in a slide-rack in a Beckman GS-6 tabletop centrifuge 
in Microplus carriers at 650 r.p.m. for 2 min. 

Array quantitation and data processing. Following hybridization, arrays 
were scanned using a laser-scanning microscope (ref. 17; http://cmgm. 
stanford.edu/pbrown). Separate images were acquired for Cy3 and Cy5. We 
carried out data reduction with the program ScanAlyze (M.B.E., available 



at http://rana.stanford.edu/software). Each spot was defined by manual 
positioning of a grid of circles over the array image. For each fluorescent 
image, the average pixel intensity within each circle was determined, and a 
local background was computed for each spot equal to the median pixel 
intensity in a square of 40 pixels in width and height centred on the spot 
centre, excluding all pixels within any defined spots. Net signal was deter- 
mined by subtraction of this local background from the average intensity 
for each spot. Spots deemed unsuitable for accurate quantitation because 
of array artefacts were manually flagged and excluded from further analy- 
sis. Data files generated by ScanAlyze were entered into a custom database 
that maintains web-accessible files. Signal intensities between the two fluo- 
rescent images were normalized by applying a uniform scale factor to all 
intensities measured for the Cy5 channel. The normalization factor was 
chosen so that the mean log(Cy3/Cy5) for a subset of spots that achieved a 
minimum quality parameter (approximately 6,000 spots) was 0. This effec- 
tively defined the signal -intensity- weigh ted average" spot on each array to 
have a Cy3/Cy5 ratio of 1 .0. 

Cluster analysis. We extracted tables (rows of genes, columns of individual 
microarray hybridizations) of normalized fluorescence ratios from the data- 
base. Various selection criteria, discussed in relation to each data set, were 
applied to select subsets of genes from the 9,703 cDNA elements on the 
arrays. Before clustering and display, the logarithm of the measured fluores- 
cence ratios for each gene were centred by subtracting the arithmetic mean of 
all ratios measured for that gene. The centring makes all subsequent analyses 
independent of the amount of each gene's mRNA in the reference pool. 

We applied a hierarchical clustering algorithm separately to the cell lines 
and genes using the Pearson correlation coefficient as the measure of simi- 
larity and average linkage clustering 3,19 - 21 . The results of this process are 
two dendrograms (trees), one for the cell lines and one for the genes, in 
which very similar elements are connected by short branches, and longer 
branches join elements with diminishing degrees of similarity. For visual 
display the rows and columns in the initial data table were reordered to 
conform to the structures of the dendrograms obtained from the cluster 
analysis. Each cell in the cluster-ordered data table was replaced by a graded 
colour (pure red through black to pure green), representing the mean- 
adjusted ratio value in the cell. Gene labels in cluster diagrams are dis- 
played here only for genes that were represented in the microarray by 
sequence-verified cDNAs. A complete software implementation of this 
process is available (http://rana.stanford.edu/software), as well as all clus- 
tering results (http://genome-www.stanford.edu/nci60). 
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IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

DECLARATION OF JOHN C. ROCKETT, Ph.D. 
UNDER 37 C.F.R. § 1.132 

I, JOHN COUGHLIN ROCKETT III, Ph.D., declare and 
state as follows: 

1. Since 1995 I have been engaged full-time in 
molecular toxicology research, with an emphasis on the 
application of expression profiling techniques, including but 
not limited to nucleic acid microarray expression profiling 
techniques, to studies of the mechanisms of toxicant action 
and to the design of assays to monitor toxicant exposure. 

2. My curriculum vitae, including my list of 
publications, is attached hereto as Exhibit A. 

3. For the past 5 years, my work has focused 
primarily on analyzing the effects of potentially hazardous 
environmental agents, such as heat, water disinfectant 
byproducts, and conazole fungicides on the male reproductive 
tract. Although we are interested in the basic mechanisms of 
action of such toxicants, we also have two practical goals in 
mind: first, to identify individual agents and families of 
agents that adversely affect male reproductive development and 
function, and second, to develop methods for monitoring human 
exposure to such agents, particularly methods capable of 
identifying toxicant exposure at an early stage. 

4 . I have relied on expression profiling as a 
principal approach to these goals. Expression profiling, by 



reporting the expression levels of thousands of genes 
simultaneously, gives us an opportunity to identify and group 
toxicants based on similarities in the patterns of gene 
expression they induce in cells and tissues; the gene 
expression profiles induced by treatment with known testicular 
toxins serve as standards, molecular signatures or molecular 
fingerprints as it were, against which the patterns of gene 
expression induced by agents of unknown toxicity may be 
compared and judged. In addition, gene expression profiling 
may give us the opportunity to detect toxicity before more 
gross phenotypic changes become manifest. 

5. In keeping with this research emphasis, I have 
until recently: 

served on the Microarray Technical 
Subcommittee pf the United States Environmental 
Protection Agency (EPA) Genomics Task Force, and 

served on the Scientific Committee for 
the conference series on "Critical Assessment of 
Techniques for Microarray Data Analysis," held 
annually at Duke University, Durham, NC; 

and I currently 

serve on the Technical Committee on the 
Application of Genomics to Mechanism- Based Risk 
Assessment of the International Life Sciences 
Institute's Health and Environmental Sciences 
Institute, 

serve on the Genomics and Proteomics 
Committee of the National Health and Environmental 
Effects Research Laboratory of the EPA's Office of 
Research and Development, 

belong to the [North Carolina Research] 
Triangle Array Users Group, 



belong to the Molecular Biology 
Speciality Section of the Society of Toxicology, 
and 

belong to the Triangle Consortium for 
Reproductive Biology. 

In addition, I am the principal investigator on a cooperative 
research and development agreement (CRADA) entitled 
"Development of a Genetic Test for Male Factor Infertility." 
Prior to this, I was a co-principal investigator on a 
materials cooperative research and development agreement 
(MCRADA) to print oligonucleotide-based microarrays; and from 
1999 - 2002, I was coinvestigator on a CRADA to develop gene 
microarrays for toxicology applications. 

6. I presume the reader's familiarity with the 
basic construction and operation of microarrays. For purposes 
of the discussion to follow, I use the phrase "nucleic acid 
microarray" and, equivalently, the term "microarray" to refer 
generically to the various types of nucleic acid microarray 
that include immobilized nucleic acid probes of sufficient 
length to permit specific binding, with minimal cross- 
hybridization, to the probe's cognate transcript, whether the 
transcript is in the form of RNA or DNA. Although this 
definition excludes microarrays having shorter probes, such as 
the 20-mer probes of arrays manufactured by Affymetrix, Inc., 
many of the comments that follow nonetheless apply to such 
microarrays as well. 

7. Although my own work with microarrays dates 
back only to 1998, and high density spotted nucleic acid 



microarrays themselves date back perhaps only to 1995, 1 
microarrays are by no means the only, nor the first, 
expression profiling tool. As I describe in detail in my 
Xenobiotica review, 2 there are a number of other differential 
expression analysis technologies that precede the development 
of microarrays, some by decades, and that have been applied to 
drug metabolism and toxicology research, including: 
(1) differential screening; (2) subtractive hybridization, 
including variants such as chemical cross- linking subtraction, 
suppression- PGR subtractive hybridization and representational 
difference analysis; (3) differential display; (4) restriction 
endonuclease facilitated analyses, including serial analysis 
of gene expression (SAGE) and gene expression fingerprinting; 
and (5) EST analysis. 

8, In my own earlier research, I used both 
reverse-transcriptase polymerase chain reaction (RT-PCR) and 
suppression-PCR subtractive hybridization (SSH) to study 
patterns of differential gene expression caused by hepatic 
challenge with nongenotoxic and genotoxic hepatotoxins . 3 



1 Schena et al . , "Quantitative monitoring of gene expression patterns 
with a complementary DNA microarray, ■ Science 270:467-470 (1995), attached 
hereto as Exhibit B. 

2 Rockett et al., "Differential gene expression in drug metabolism and 
toxicology: practicalities, problems and potential," Xenobiotica 29:655-691 
(1999) (hereinafter, "Xenobiotica review"), attached hereto as Exhibit C. 

3 See, e.g., Rockett et al., "Molecular profiling of non-genotoxic 
carcinogenesis using differential display reverse transcription polymerase 
chain reaction (ddRT-PCR) , " European J. Drug Metabolism & Pharmacokinetics 
22(4):329-33 (1997), and Rockett et al . , "Use of a suppression-PCR 
subtractive hybridization method to identify gene species which demonstrate 
altered expression in male rat and guinea pig livers following 3 -day 
exposure to [4-chloro-6- (2 , 3-xylidino) -2-pyrimidinylthio] acetic acid," 
Toxicology 144 ( 1-3 ): 13-29 (2000), attached hereto respectively as Exhibits 
D and E. 



9. These older transcript expression profiling 
techniques provide analogous expression data, but with far 
lower throughput. 

10. It has been well-established, at least since 
the introduction of high density spotted microarrays in 1995, 
that: 

(i) each probe on the microarray, with 
careful design and sufficient length, and with 
sufficiently stringent hybridization and wash 
conditions, binds specifically and with minimal 
cross-hybridization, to the probe's cognate 
transcript; 

(ii) each additional probe makes an 
additional transcript newly detectable by the 
microarray, increasing the detection range, and 
thus versatility, of this analytical device for 
gene expression profiling; 4 

(iii) it is not necessary that the 
biological function be known in order for the gene, 



4 The compelling logic of this proposition has likely motivated the 

remarkably rapid progress from the earliest high density spotted arrays in 
1995 (Schena et al., "Quantitative monitoring of gene expression patterns 
with a complementary DNA microarray, " Science 270:467-470 (1995), attached 
hereto as Exhibit B) , to the first whole genome arrays in 1997 (Lashkari et 
al., "Yeast microarrays for genome wide parallel genetic and gene 
expression analysis," Proc. Natl. Acad. Sci. USA 94 (24) :13057-62 (1997) and 
DeRisi et al., "Exploring the metabolic and genetic control of gene 
expression on a genomic scale," Science 278 (5338) : 680-6 (1997), attached 
hereto as Exhibits F and G, respectively) , to the concurrent announcement 
by two companies earlier this month of their respective commercial 
introductions of single chip human whole genome arrays (Pollack, "Human 
Genome Placed on Chip; Biotech Rivals Put it Up for Sale," The New York 
Times, Thursday, October 2, 2003 (Business Day), attached hereto as 
Exhibit H; "Agilent Technologies ships whole human genome on single 
microarray to gene expression customers for evaluation, " Press Release, 
Agilent Technologies, October 2, 2003, attached hereto as Exhibit I; 
"Affymetrix Announces Commercial Launch of Single Array for Human Genome 
Expression Analysis; More Than 1 Million Probes Analyze Expression Levels 
of Nearly 50,000 RNA Transcripts and Variants on a Single Array the Size of 
a Thumbnail, " Press Release, Affymetrix, October 2, 2003, attached hereto 
as Exhibit J) . 



or a fragment of the gene, to prove useful as a 
probe on a microarray to be used for expression 
analysis; 

(iv) failure of a probe to detect changes 
in expression of its cognate gene does not diminish 
the usefulness of the probe on the microarray; and 

(iv) failure of a probe* to detect a 
particular transcript in any single experiment does 
not deprive the probe of usefulness to the 
community of users who would use this research 
tool . 

These principles also apply to transcript expression profiling 
techniques that antedate the development of high density 
spotted microarrays, and accordingly were well-understood 
prior to 1995. 

.11. Moreover, expression profiling is not limited 
to the measurement of mRNA transcript levels. It is widely 
understood among molecular and cellular biologists that 
protein expression levels provide complementary profiles for 
any given cell and cellular state. Although I cannot claim 
credit for having coined the phrase, I have written that the 
difference between transcript expression profiling and protein 
expression profiling is that " transcriptomics indicates what 
should happen, and proteomics shows what is happening.* 5 

12. For decades, such protein expression profiles 
have been generated using two dimensional polyacrylamide gel 



5 Rockett, "Macroresults through Microarrays," Drug Discovery Today 

7:804 - 805 (2002) (emphasis added), attached hereto as Exhibit K. 



electrophoresis (2D-PAGE) , and used, among other things, to 
study drug effects.* 



13. Although the protein expression profiles 
produced by 2D- PAGE analysis are analogous to the transcript 
expression profiles provided by nucleic acid microarrays, an 
even closer analogy is perhaps offered by antibody 
microarrays; as I note in my Drug Discovery Today commentary, 
such antibody microarrays date back to the work of Roger Ekins 
in the mid- to late-1980s. 7 

14 . The principles in paragraph 10 also apply to 
protein expression profiling analyses, particularly to 
analyses performed using antibody microarrays. Thus, as with 
nucleic acid microarrays, the greater the number of proteins 
detectable, the greater the power of the technique; the 
absence or failure of a protein to change in expression levels 
does not diminish the usefulness of the method; and prior 
knowledge of the biological function of the protein is not 
required. As applied to protein expression profiling, these 
principles have been well understood since at least as early 
as the 1980s. 

15. Both gene and protein expression profiling are 
particularly useful to the toxicologist , especially in the 
pharmaceutical industry. Accordingly, I made the following 



' See, e.g., Anderson et al.. "A two-dimensional gel database of rat 
liver proteins useful in gene regulation and drug effects studies," 
Electrophoresis 12:907 - 930 (1991), attached hereto as Exhibit L. 

7 See Ekins et al., J. Bioluminescence Chemi luminescence 5:59-78 
(1989)- Ekins et al., Clin. Chem. 37: 1955-1965 (1991); and Ekins, U.S. 
Patentees. 5.432,099, 5,807,755, and 5,837,551, attached hereto 
respectively as Exhibits M to Q. 



statements in my Xenobiotica review, written in the summer 
1998: 

[I]n the field of chemical-induced 
toxicity, it is now becoming increasingly obvious 
that most adverse reactions to drugs and chemicals 
are the result of multiple gene regulation, some of 
which are causal and some of which are casually- 
related to the toxicological phenomenon per se. 
This observation has led to an upsurge in interest 
in gene-profiling technologies which differentiate 
between the control and toxin- treated gene pools in 
target tissues and is, therefore, of value in 
rationalizing the molecular mechanisms of 
xenobiotic-induced toxicity. 

Knowledge of toxin-dependent gene 
regulation in target tissues is not solely an 
academic pursuit as much interest has been 
generated in the pharmaceutical industry to harness 
this technology in. the early identification of 
toxic drug candidates, thereby shortening the 
developmental process and contributing 
substantially to the safety assessment of new 
drugs . 

For example, if the gene profile in 
response to say a testicular toxin that has been 
well-characterized in vivo could be determined in 
the testis, then this profile would be 
representative of all new drug candidates which act 
via this specific molecular mechanism of toxicity, 
thereby providing a useful and coherent approach to 
the early detection of such toxicants. 

Whereas it would be informative to know 
the identity and functionality of all genes up/down 
regulated by such toxicants, this would appear a 
longer term goal, as the majority of human genes 
have not yet been sequenced, far less their 
functionality determined. However, the current use 
of gene profiling yields a pattern of gene changes 
for a xenobiotic of unknown toxicity which may be 
matched to that of well-characterized toxins, thus 
alerting the toxicologist to possible in vivo 
similarities between the unknown and the 
standard. ... 



Despite the development of multiple 
technological advances which have recently brought 
the field of gene expression profiling to the 
forefront of molecular analysis, recognition of the 
importance of differential gene expression and 
characterization of differentially expressed genes 
has existed for many years. 



16. As noted in the preceding excerpt from my 
Xenobiotica review, expression profiling in toxicology studies 
yield patterns of changes that are characteristic of an agent 
of unknown toxicity, which patterns may usefully be matched to 
those of well-characterized toxins. 

17. In the context of such patterns of gene 
expression, each additional gene- specific probe provides an 
additional signal that could not otherwise have been detected, 
giving a more comprehensive, robust, higher resolution and 
thus more useful pattern than otherwise would have been 
possible. 5 

18. It is my opinion, therefore, based on the state 
of the art in toxicology at least since the mid-1990s — and 
as regards protein profiling, even earlier that disclosure 
of the sequence of a new gene or protein, with or without 
knowledge of its biological function, would have been 



8 In a sense, each gene-specific probe used in such an analysis is 

analogous to a different one of the many parts of an engine, with each 
individual part, or subcombinations of such parts, deriving at least part 
of their usefulness from the utility of the completed combination, the 
functioning engine. 



sufficient information for a toxicologist to use the gene 
and/or protein in expression profiling studies in toxicology. 

19. The statements made in this declaration 
represent my individual views and are not intended to 
represent the opinion of my employer, the United States 
Environmental Protection Agency, or of any other branch of the 
federal government. Other than my current engagement to 
provide this declaration, I have neither had, nor currently 
have, financial ties to, or financial interest in, Incyte 
Corporation. I am not myself an inventor on any patent 
application claiming a gene or gene fragment. 

20. I declare further that all statements made 
herein of my own knowledge are true and that all statements 
made on information and belief are believed to be true, and 
further that these statements were made with the knowledge 
that willful false statements and the like so made are 
punishable by fine or imprisonment, or both, under 
Section 1001 of Title 18 of the United States Code and may 
jeopardize the validity of any patent application in which 
this declaration is filed or any patent that issues thereon. 



Date 
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PERSONAL DETAILS 



Name: 



John Coughlin Rockett HI 



Nationality: 



USA 



Work Address: United States Environmental Protection Agency 



National Health and Environmental Effects Research Laboratory 

Reproductive Toxicology Division (MD-72) 

Gamete and Early Embryo Biology Branch 

Research Triangle Park 

NC 27711 

USA 



Work Telephone 



+001 (919) 541 2678 



Work Fax: 



+001 (919) 541 4017 



Work E-mail: 



rockett. i ohn@epa. gov 



Employment and Higher Education 



CURRENT POSITION (12/00-present) 
Research Biologist 

Gamete and Early Embryo Biology Branch (MD-72) 
Reproductive Toxicology Division 

National Health and Environmental Effects Research Laboratory 

US Environmental Protection Agency 

Research Triangle Park 

NC 27711 

USA 

PREVIOUS POSITIONS 

8/98-12/00: NHEERL Post-Doctoral Research Fellow, Gamete and Early Embryo Biology 
Branch, Reproductive Toxicology Division, National Health and Environmental Effects 
Research Laboratory, United States Environmental Protection Agency, Research Triangle Park, 
NC, USA. 

Supervisors: Dr Sally P. Darney (Scientific publications under Sally D. Perreault) and Dr David 
J. Dix. 

5/95-7/98: Rhone-Poulenc Post-Doctoral Research Fellow, Molecular Toxicology Group, School 
of Biological Sciences, University of Surrey, Guildford, Surrey, England. 
Supervisor: Prof. G. Gordon Gibson. 

EDUCATION 

Ph.D., 1995 - University of Warwick, Coventry, W. Midlands, England 

Title: Transforming Growth Factor-p* and Immune Recognition Molecules in Oesophageal 

Cancer. 

Supervisors: Dr Alan G. Morris (University of Warwick) and Dr S. Jane Damton (Birmingham 
Heartlands Hospital) 

B.Sc. (Hons.), 1991 - University of Warwick, Coventry, W. Midlands, England. 

Degree: Microbiology and Microbial Technology (with intercalated year in industry), Class 2i. 

Tutor: Professor Howard Dalton. 



PROFESSIONAL ACTIVITIES 



Membership of Professional Societies: 

Society of Toxicology (Inc. Molecular Biology Speciality Section) (2001 -present) 
Science Advisory Board (2001 -present) 

North Carolina Chapter of the Society of Toxicology (1999-present) 

Triangle Consortium for Reproductive Biology ( 1 999-present) 

Triangle Array Users Group ( 1 999-present) 

Institute of Biology (U.K.) (1989 - present) 

British Toxicology Society (1996 - 2000) 

Biochemical Society (U.K.) (1992-1995) 

British Society for Immunology (1992-1995) 

Membership of Scientific Committees: 

International Life Sciences Institute's (ILSI) Health and Environmental Sciences Institute (HESI) 
Technical Committee on the Application of Genomics to Mechanism-Based Risk Assessment: 

• Steering Committee (5/02-present). 

• Hepatotoxicity Working Group Vice-Chair (5/02-present). 

• Hepatotoxicity Work Group Member (5/01 -present). 

Charter member, Fertility and Early Pregnancy Work Group of the National Children's Study 
(07/01-Present). 

National Health and Environmental Effects Research Laboratory Distinguished Lecture Series 
Committee (July 03-present). 

U.S. Environmental Protection Agency Genomics Task Force Microarray Technical 
Subcommittee (August 03-present). 

National Health and Environmental Effects Research Laboratory Genomics and Proteomics 
Committee (NGPC) (September 03-present). 



Professional Meetings: 

Invited participant ("Observer") in Expert Panel Workshop: "The Role of Environmental Factors 
on the Onset and Progression of Puberty in Children". Organised by Serono Symposia 
International. November 6 lh -8 th , 2003, Chicago, IL, USA. 

Joint organiser and co-chair of: "Genomic analysis of surrogate tissues for measuring toxic 
exposures and drug action", the "Innovations in Applied Toxicology" Symposium for the Society 
of Toxicology 42 nd Annual Meeting, March 9 tb -13 lh , 2003, Salt Lake City, UT, USA. 
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(8) John C. Rockett, David J. Esdaile and G Gordon Gibson (1999). Differential gene expression 
in drug metabolism: practicalities, problems and potential. Xenobiotica, 29(7):655-691 . 
(7) MC Murphy, CN Brookes, JC Rockett, C Chapman, JA Lovegrove, BJ Gould, JW Wright and 
CM Williams (1999). The quantitation of lipoprotein lipase mRNA in biopsies of human adipose 
tissue, using the polymerase chain reaction, and the effect of increased consumption of n-3 
polyunsaturated fatty acids. European Journal of Clinical Nutrition, 53:441-447. 

(6) JC Rockett, DJ Esdaile and GG Gibson (1997). Molecular profiling of non-genotoxic 
carcinogenesis using differential display reverse transcription polymerase chain reaction (ddRT- 
PCR). European Journal of Drug Metabolism & Pharmacokinetics 22(4):329-33. 

(5) Rockett, J., Larkin, K., Damton, S., Morris, A. and Matthews, H. (1 997). Five newly 
established oesophageal carcinoma cell lines: phenotypic and immunological characterisation. 
British Journal of Cancer 75(2) :25 8-263. 

(4) J C Rockett, S J Damton, J Crocker, H R Matthews and A G Morris (1996). Lymphocyte 
infiltration in oesophageal carcinoma: lack of correlation with MHC antigens, ICAM-1, and tumour 
stage and grade. Journal of Clinical Pathology 49:264-267. 

(3) J C Rockett, S J Damton, J Crocker, H R Matthews and A G Morris (1995). Expression of HL- 
ABC and HLA-DR histocompatability antigens and intercellular adhesion molecule- 1 in 
oesophageal carcinoma. Journal of Clinical Pathology 48:539-44. 

(2) Salam M, Rockett J and Morris A (1995). The prevalence of different human papillomavirus 
types and p53 mutations in laryngeal carcinomas: is there a reciprocal relationship? European 
Journal of Surgical Oncology 21:290-296. 

(1) Salam M, Rockett J and Morris A (1995). General primer-mediated polymerase chain reaction 
for simultaneous detection and typing of HPV in laryngeal carcinomas. Clinical Otolaryngology 
20:84-88. 

(2) Articles Submitted To A Scientific Journal 

(4) John C. Rockett, Judith E. Schmid, Christopher J. Luft, J. Brian Garges, M. Stacey Ricci, 
Pasquale Patrizio, Norman B. Hecht and David J. Dix. Gene Expression Patterns Associated with 
Infertility in Rodent and Human Models. * An invited submission* 

(3) Roger Ulrich, John C. Rockett, G. Gordon Gibson and Syril Pettit. Evaluating the Effects of 
Methapyrilene and Clofibrate on Hepatic Gene Expression: A Collaboration Between Laboratories 
and a Comparison of Platform and Analytical Approaches. 

(2) Valerie A Baker, Helen M Harries, Jeffrey F Waring, Roger Jolly, Angus de Souza, Judith E 
Schmid, Hong Ni, Roger Brown, Roger G Ulrich and John C. Rockett. Clofibrate-Induced Gene 
Expression Changes in Rat Liver: A Cross-Laboratory Analysis Using Membrane cDNA Arrays. 
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(1) David Miller, Corrado Spadafora, David Dix, Adrian Platts, John C. Rockett, Stephen A 
Krawetz Nuclease digestion of sperm chromatin suggests a random distribution of gene sequences. 

(3) Articles In Preparation For Submission To A Scientific Journal 

(3) Spearow J, DB Tully, JC Rockett and DJ Dix. Differential testicular gene expression in mouse 
strains sensitive and resistant to endocrine disruption by estrogen. 

(2) Sally D. Perrault, John C. Rockett, Laura Fenster, James Kesner, Wendy Robbins and Steven 
Schrader. Biomarkers for Assessing Reproductive Development and Health: Part 2 - Adult 
Reproductive Health. 

(1) J. Christopher Luft, Douglas B. Tully, John C. Rockett, Judith E. Schmid and 
David J. Dix. Reproductive and genomic effects in testes from mice exposed to the water 
disinfectant byproduct bromochloroacetic acid 

(4) Book Chapters 

(4) John C. Rockett. Gene Microarrays Applied to Reproductive Toxicology. In Cunningham 
(Ed): Genetic and Proteomic Applications in Toxicity Testing, The Human Press, Totowa. In 
Preparation. * An invited submission* 

(3) John C. Rockett and David J Dix. Gene Expression Networks. In Cooper (ed-in-chief): 
Encyclopaedia of the Human Genome, Nature Publishing Group. London, New York. ISBN 0-333- 
80386-8 (2003). * An invited submission* 

(2) John C. Rockett. The Future of Toxicogenomics. In Michael Burczynski (ed): "An; 
Introduction to Toxicogenomics". CRC Press. Boca Raton, London, New York, Washington D.C., 
pp299-3 1 7 (2003). * An invited submission* 

(I) J. Rockett, S. Darnton, J. Crocker, H. Matthews and A. Morris: Major Histocompatibility 
Complex (MHC) class I and II and Intercellular Adhesion Molecule (ICAM)-l expression in 
oesophageal carcinoma. Peracchia A, Rosati R, Bonavina L, Bona S, Chella B (eds): Recent 
Advances in Diseases of the Esophagus. Bologna: Monduzzi Editore, pp45-49 (1996). 

(5) Other Scientific Publications (Letters to Editors; Meeting Reports; Commentaries 
etc.) 

(II) John C Rockett (2003). Probing the nature of microarray-based oligonucleotides. Drug 
Discovery Today 8(9):389. (A Letter To The Editor) * An invited submission* 

(10) John C. Rockett (2003). To confirm or not to confinn (microarray data) - that, is the question. 
Drug Discovery Today 8(8):343. (A Letter To The Editor) 
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(9B) Nazzareno Ballatori, James L. Boyer, and John C. Rockett. (2003). Exploiting Genome Data 
to Understand the Function, Regulation and Evolutionary Origins of Toxicologically Relevant 
Genes. Environ Health Perspect. 1 1 1(6):871-5. (A Meeting Report) 

(9A) Nazzareno Ballatori, James L. Boyer, and John C Rockett. (2003). Exploiting Genome Data 
to Understand the Function, Regulation and Evolutionary Origins of Toxicologically Relevant 
Genes. EHP Toxicogenomics. 1 1 l(lT):61-5. (A Meeting Report) 

(8) John C. Rockett (2002). Surrogate Tissue Analysis for Monitoring the Degree and Impact of 
Exposures in Agricultural Workers. AgBiotechNet, 4:1-7 November, ABN 100. (A Review Article), 
* An invited submission * 

(7) John C. Rockett (2002). Macroresults Through Microarrays. Drug Discovery Today, 7(15);804- 
805. (A Meeting Report) 

(6) John C. Rockett (2002). Chip, chip, array! Three chips for post-genomic research. Drug 
Discovery Today, 7(8);458-459. (A Meeting Report) 

(5) John C Rockett (2002). Use of Genomic Data in Risk Assessment. GewomeBiology, 3(4): 
reports401 1.1-401 1.3 ( http://genomebiology.eom/2002/3/4/reports/401 l/?isguard=l) . (A Meeting 
Report) 

(4) John C. Rockett (2001). Genomic and Proteomic Techniques Applied to Reproductive 
Biology. GenomeB\o\ogy 2(9): 4020.1-4020.3 (http://genomebiology.eom/2001/2/9/reports/4020/) . 
(A Meeting Report) 

(3) John C. Rockett (2001). Chipping away at the mystery of drug responses. The 
Pharmacogenomics Journal, 1(3);161-163. (A commentary) * An invited submission* 

(2) Rockett, John C. and Dix, David J. (1999). U.S. EPA workshop: Application of DNA arrays to 
Toxicology. Environmental Health Perspectives, 107(8):681-685. (A Meeting Report) 

(1) John C. Rockett III (1995). Immune recognition molecules and transforming growth factor 
beta-1 in oesophageal cancer. Ph.D. thesis, University of Warwick, Coventry, England.(P/2.D. 
thesis) 

(6) Published Book, Paper and Website reviews 

(9) John C. Rockett (2002). A report on the manuscript: Systemic RNAi in C. elegans requires the 
putative transmembrane protein SID-1. Winston WM, Molodowitch C, Hunter CP. Science. 2002 
295:2456-2459. GenomeW\o\ogy, 3(7):reports0034 
http://genomebiology.eom/2002/3/7/reports/0034/ 
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(8) John C. Rockett (2001). A report on the manuscript: Genetic rescue of an endangered mammal 
by cross-species nuclear transfer using post-mortem somatic cells. P Loi , et al., Nat Biotechnol. 
2001, 19:962-964. Ge/io/neBiology, 3(l):reports0006. 
(htt p-//genomebiologv.com/2001/3/l/reports/0006A . 

(7) John C. Rockett (2001). A report on the manuscript: Molecular Classification of Human 
Carcinomas by Use of Gene Expression Signatures. A Su et al., Cancer Res. 2001 61:7388-7393. 
GewoweBiology, 3(l):reports0005. (http://eenomebiologv.eom /2001/3/l/reports/0005/). 

(6) John C. Rockett (2001). A report on the manuscript: Genetic evidence for two species of 
elephant in Africa. A Roca et al., Science. 2001 Aug 24;293(5534): 1473-7. GenomeBiology, 
2(12):reports0045. rhttp://www.genomebiologv.com/2001/2/12/re ports/0045/-. 

(5) John C. Rockett (2001). A report on the manuscript: Extensive genetic polymorphism in the 
human CYP2B6 gene with impact on expression and function in human liver. T Lang et al., 
Pharmacogenetics, 2001, 11(5):399-415. GenomeBiology, 2(12):reports0044. 
fhttp://ww.genomebiologv.conV2001/2/12/reports/0044/) . 

(4) John C. Rockett (2001). A report on the manuscript: Novel Human Testis-Specific cDNA: 
' molecular Cloning, Expression and Immunological Effects of the Recombinant Protein. R 
Santhanam and R K Naz, Molecular Reproduction and Development 60:1-12 (2001). 
GewoweBiology, 2(ll):reports0040. fhttp://genomebiologv.com /2001/2/ll/reports/0040A. 

(3) John C. Rockett (2001). A report on the website: BIND - The Biomolecular Interaction 
Network Database (http://www.bind.ca/> . GewoweBiology, 2(9): reports201 1 . 
http://www.genomebiology.eom/2001/2/9/reports/2011/ . 

(2) John C. Rockett (2001). A report on the manuscript: Exploring the DNA-binding specificities 
of zinc fingers with DNA microarrays. ML Bulyk et al., Proc Natl Acad Sci USA 2001, 98:7158- 
7163. GenomeBiology, 2(10): reports0032. (http://genomebiologv.eom /2001/2/10/reports/0032/). 

(1) J Rockett (1996). A Book Review on: "Cell Adhesion and Cancer" (Eds., Hogg N. and Hart I.). 
Clinical Molecular Pathology 49(1):M64. * An invited submission* 

(7) Published Abstracts of Poster and Oral Presentations 

(17) Amber K. Goetz, Wenjun Bao, Judith E. Schmid, Carmen Wood, Hongzu Ren, Deborah S. 
Best, Rachel N. Murrell, John C. Rockett, Michael G. Narotsky, Douglas C. Wolf, Douglas B. 
Tully, David J. Dix Gene Expression Profiling in Testis and Liver of Mice to Identify Modes of 
Action of Conazole Toxicities. Society of Toxicology 43 rd Annual Meeting, March 21 st -25 ,h , 2004, 
Baltimore, MD, USA. Toxicological Sciences. (Submitted) 

(16) Jane Gallagher, Theresa Lehman, Ramakrishna Modali, Scott Rhoney, Marien Clas, Jeff 
Inmon, John C. Rockett, David Dix, Cindy Mamay, Suzanne Fenton, Suzanne McMaster, Stan 
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Barone Jr, Pauline Mendola and Reeder Sams. Validation of Non-Invasive Biological Samples: 
Pilot Projects Relevant to the National Children Study. Society of Toxicology 43 rd Annual Meeting, 
March 21 st -25 tb , 2004, Baltimore, MD, USA. Toxicological Sciences. (Submitted) 

(15) B.S. Pukazhenthi, J. C. Rockett, M. Ouyang, DJ. Dix, J.G. Howard, P. Georgopoulos, W.J. J. 
Welsh and D. E. Wildt. Gene Expression In The Testis Of Normospermic Versus Teratospermic 
Domestic Cats Using Human cDNA Microarray Analyses. Society for the Study of Reproduction 
36 th Annual Meeting, July 19 th -22 nd , 2003, Cincinnati, OH, USA. Biology of Reproduction 68 (Supp 
1):191. 

(14) David J. Dix and John C. Rockett (2003). Genomic and Proteomic Analysis of Surrogate 
Tissues for Assessing Toxic Exposures and Disease States. Innovation in Applied Toxicology 
symposium entitled "Genomic and Proteomic Analysis of Surrogate Tissues for Assessing Toxic 
Exposures and Disease States'". Society of Toxicology 42 nd Annual Meeting, March 9 th -13 th , 2003, 
Salt Lake City, UT, USA. Toxicological Sciences 72(S-1):276. 

(13) John C. Rockett, Chad R. Blystone, Amber K. Goetz, Rachel N. Murrell, Judith E. Schmid 
and David J. Dix. (2003). Gene Expression Profiling Of Accessible Surrogate Tissues To Monitor 
Molecular Changes In Inaccessible Target Tissues Following Toxicant Exposure. Innovations in 
Applied Toxicology Symposium entitled "Genomic and Proteomic Analysis of Surrogate Tissues 
for Assessing Toxic Exposures and Disease States". Society of Toxicology 42 nd Annual Meeting, 
March 9 th -13 lh , 2003, Salt Lake City, UT, USA. Toxicological Sciences 72(S-1):276. 

(12) Douglas B. Tully, J. Christopher Luft, John C. Rockett, Judy E. Schmid and David J. Dix 
(2002). Effects on gene expression in testes from adult male mice exposed to the water disinfectant 
byproduct bromochloroacetic acid. Society for the Study of Reproduction 35 th Annual Meeting, July 
28-31, 2002, Baltimore, Maryland, USA. Biology of Reproduction 66 (Supp 1):223. 

(11) David J. Dix, Kary E. Thompson, John C Rockett, Judith E. Schmid, Robert J. Goodrich, 
David Miller, G. Charles Ostermeier and Stephen A. Krawetz (2002). Testis and spermatazola RNA 
profiles of normal fertile men. Society for the Study of Reproduction 35 th Annual Meeting, July 28- 
31, 2002, Baltimore, Maryland, USA. Biology of Reproduction 66 (Supp 1):194. 

(10) Asa J. Oudes, John C. Rockett, David J. Dix and Kwan Hee Kim (2002). Identification of 
retinoic acid induced genes in mouse testis by cDNA microarray analysis. 27 th Annual Meeting of 
the American Society of Andrology, 4/24-27/02. J. Andrology Supplement March/ April. 

(9) John C Rockett, Robert J. Kavlock, Christy Lambright, Louise G. Parks, Judith E. Schmid, 
Vickie S. Wilson and David J. Dix (2002). Use of DNA arrays to monitor gene expression in blood 
and uterus from Long-Evans rats following 17-p-estradiol exposure - a new approach to 
biomonitoring endocrine disrupting chemicals using surrogate tissues. Toxicological Sciences 66(1): 
Abstract No.1388. 

(8) David J. Dix and John C. Rockett (2002). Genomic analysis of the testicular toxicity of 
haloacetic acids. Platform presentation at the symposium, "Defining the cellular and molecular 
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mechanisms of toxicant action in the testis". Toxicological Science 66 (1): Abstract No.848. 

(7) JC Rockett, JC Luft, JB Garges and DJ Dix (2001). The reproductive effects of the water 
disinfectant byproduct bromochloroacetate on juvenile and adult male mice. Toxicological 
Sciences, 60 (1):250. 

(6) Tarka DK, Klinefelter GR, Rockett JC, Suarez JD, Roberts NL and Rogers JM (2001). Effect 
of gestational expsore to ethane dimethane sulfonate (EDS), bromochloroacetic acid (BCA) and 
molinate on reproductive function in CD-I male mice. Toxicological Sciences, 60 (1):250. 

(5) Garges JB, Rockett JC and Dix DJ (2001). Developmental and reproductive phenotype of mice 
lacking stress-inducible 70 kDa heat shock proteins (Hsp70s). Toxicological Sciences, 60 (1):383. 
(4) D Dix, J Rockett, J Luft, J Garges, M Ricci, P Patrizio and N Hecht (2000). Using DNA 
microarrays to characterise gene expression in testes of fertile and infertile humans and mice. 
Biology of Reproduction, 62 (sl);227. 

(3) J Luft, J B Garges, J Rockett and D Dix (2000). Male reproductive toxicity of 
bromochloroacetic acid in mice. Biology of Reproduction, 62 (sl);246. 

(2) Rockett, JC, Garges, JB and Dix, DJ (2000). A single heat-shock of juvenile male mice causes 
a long-term decrease in fertility and reduces embryo quality. Toxicological Sciences 54 (1):365. 

(1) JC Rockett, SJ Darnton, J Crocker, HR Matthews and AG Morris (1994). Major 
Histocompatability (MHC) class I and II and intercellular adhesion molecule (ICAM)-l expression 
in oesophageal carcinoma (OC). Immunology 83 (sl):64. 



(8) Invited Oral Presentations 

(10) John C. Rockett and Gary M Hellmann. To confirm or not to confirm (microarray data) - 
that is the question. Seminar for EPA/NHEERL Genomics and Proteomics Committee's ArrayQA 
forum, August 25 th , 2003, RTP, NC, USA. 

(9) John C. Rockett. "Biomonitoring Toxicant Exposure and Effect Using Toxicogenomics and 
Surrogate Tissue Analysis". Seminar for Division of Epidemiology, Statistics and Prevention 
Research, National Institute of Child Health and Development, May 29 th , 2003, Rockville, MD, 
USA. 

(8) John C. Rockett. "Genomics and Proetomics: New Toxicity Testing". Platform presentation at 
US EPA Regional Risk Assessors Annual Conference, April 28 th - May 2 nd , 2003, Stone Mountain, 
GA,USA. 

(7) John C. Rockett, Chad R. Blystone, Amber K. Goetz, Rachel N. Murrell, Judith E. Schmid and 
David J. Dix. "Gene Expression Profiling Of Accessible Surrogate Tissues To Monitor Molecular 
Changes in Inaccessible Target Tissues Following Toxicant Exposure." Platform presentation at 
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SoT 42 nd Annual Meeting symposium entitled "Genomic and Proteomic Analysis of Surrogate 
Tissues for Measuring Toxic Exposures and Drug Action", March 9 th -13 th , 2003, Salt Lake City, 
UT,USA. 

(6) John C. Rockett. "A Toxicogenomic Approach to Surrogate Tissue Analysis". Seminar for 
Department of Environmental and Molecular Toxicology, North Carolina State University, 
September 3 rd , 2002, Raleigh, NC, USA. 

(5) John C. Rockett. "Differential gene expression in toxicology: practicalities, problems and 
potential". Platform presentation at 9 th Annual Mount Desert Island Biological Laboratory 
Environmental Health Sciences Symposium: Exploiting Genome Data to Understand the Function, 
Regulation and Evolutionary Origins of Toxicologically Relevant Genes, July 10 l -ll 1 , 2002, 
Salisbury Cove, Maine, USA. 

(4) John C. Rockett, Leroy Folmar, Michael J. Hemmer and David J. Dix. "Arrays for 
biomonitoring environmental and reproductive toxicology". Platform Presentation at Macroresults 
Through Microarrays 3 - Advancing Drug Development, April 29 th -May 1 st , 2002, Boston, MA, 
USA. 

(3) John C. Rockett, Sigmund Degitz, Suzanne E. Fenton, Leroy Folmar, Michael J. Hemmer, Joe 
E Tietge, and David J. Dix. "Use of DNA Arrays in Environmental Toxicology". Platform 
presentation at the 4 th Annual Lab-on-a-Chip and Microarrays for Post-Genomic Applications 
meeting, January 14 th - 16 th , 2002, Zurich, Switzerland. 

(2) John C. Rockett. "DNA Arrays". Seminar at EPA Molecular Biology Course, April 8 th , 1999, 
USEPA, RTP, NC, USA. 

(1) John C. Rockett. "Contract Services for Array Applications". Seminar at the Triangle Array 
Users Group, May 1 st , 1999, CUT, RTP, NC, USA. 



(9) Other Poster and Oral Presentations 

(23) John C. Rockett, Wenjun Bao, Chad R. Blystone, Amber K. Goetz, Rachel N. Murrell, 
Hongzu Ren, Judith E. Schmid, Jessica Stapelfeldt, Lillian F. Strader, Kary E. Thompson and David 
J. Dix. Genomic Analysis of Surrogate Tissues for Assessing Environmental Exposures and Future 
Disease States. JJLSI-HESI meeting: Toxicogenomics in Risk Assessment - Assessing the Utility, 
Challenges, and Next Steps. June 5 th -6 th , 2003, Fairfax, VA, USA. 

(22) John C. Rockett, Wenjun Bao, Chad R. Blystone, Amber K. Goetz , Rachel N. Murrell, 
Hongzu Ren, Judith E. Schmid, Jessica Stapelfeldt, Lillian F. Strader, Kary E. Thompson and David 
J. Dix. Genomic Analysis of Surrogate Tissues for Assessing Environmental Exposures and Future 
Disease States. EPA Science Forum, May 5 th -7 th , 2003, Washington, D.C., USA. 
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(21) Germaine Buck, Courtney Johnson, Joseph Stanford, Anne Sweeney, Laura Schieve John 
Rockett Sherry Selevan and Steve Schrader. Prospective Pregnancy Study Designs ^ Assessing 
Rqp^ve^L Developmental Toxicants. American Epidemiology Society Meeting, March 27 - 
28 , 2003, Atlanta, GA, USA. 

(20) John C. Rockett, Chad R. Blystone, Amber K. Goetz, Rachel N. Murrell, Hongzu Ren, Judith 
E Sctanid, Jessica Stapelfeldt, Lillian F. Strader, Kary E. Thompson, Doug B. Tully, Paul Zigas 
1 David J. Dix. Genomic Analysis of Surrogate Tissues for Assessing Environmental Exposures 
and Future Disease States. National Children's Study Assembly Meeting, December 16 -18 , 
Baltimore, MD, USA. 

(19) John Rockett. The Use of Gene Expression Profiling to Detect Early Biomarkers of Adverse 
Effects Prior to Clinical manifestation. National Children 's Study: Meeting of EPA Project Leaders 

Sods Development Projects. November 20 th , 2002, USEPA, RTP, NC, USA. (Oral 
Presentation) 

(18) GC Ostermeier, RJ Goodrich, K Thompson, J Rockett, MP Diamond, K Collins, NICHD 
Reproductive Medicine Network, DJ. Dix, D Miller and SA Krawetz. Defining the spermatozoa 
RNA population in normal fertile men. American Society of Reproductive Medicine October 12-17, 
2002, Seattle, WA, USA. 

(17) G Charles Ostermeier, Robert J. Goodrich, Kary Thompson, John Rockett, Michael P 
Diamond Karen Collins, NICHD Reproductive Medicine Network, David J. Dix, David Miller and 
Stephen A Krawetz. RNAs isolated from ejaculate spermatozoa provide a noninvasive means to 
investigate testicular gene expression. Gordon Conference on Mammalian Gametogenesis & 
Embryogenesis, June 30 th -July 5 th , Connecticut College, New London, CT, USA. 

(16) David Dix, John Rockett, Judith Schmid, Lillian Strader, Douglas Tully. Genomic analysis of 
SS^t^ly. USEPAMHEERL/RTD Peer Review, October 22 nd , 2001 , RTP, NC, USA. 

(15) David Dix John Rockett, Judith Schmid, Douglas Tully. Monitoring human reproductive 
health and development through gene expression profiling. USEPA/NHEERL/RTD Peer Review, 
October 22 nd , 2001, RTP, NC, USA. 

(14) Patrizio P, N Hecht, J Rockett, J Schmid and D Dix (2001). DNA microarrays to study gene 
expression profiles in testis of fertile and infertile men. 57th Annual Meeting of the American 
Society for Reproductive Medicine, October 20 th -25 ,h , 2001 , Orlando, FL, USA. 

(13) Jimmy L Spearow, Dale Morris, Uland Wong, Rashid Altafi, Saeed Eteiwi, Mark Stanford, 
Trevor Stearns, Lorena Orozio, Angela Chen, John Rockett, Douglas Tully, David Dix and 
Marylynn Barkley. Genetic Variation In Susceptibility To The Disruption Of Testicular 
Development And Gene Expression By Pubertal Exposure To Estrogenic Agents. Third Annual 
University of California at Davis Conference for Environmental Health S^ 
Developing Systems and Advances in Therapeutic Approaches August 27 ,2001, UC Davis, la, 

USA. 
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(12) Tarka DK Klinefelter GR, Rockett JC, Suarez JD, Roberts NL and Rogers JM (2001). Effect 
of gestational e'xpsore to ethane dimethane sulfonate (EDS), bromochloroacetic acid (BCA) and 
molinate on reproductive function in CD-I male mice. North Carolina Society of Toxicology Winter 
Meeting, March 3 rd , 2001. NIEHS, RTP, NC, USA. 

(11) David Dix, John Rockett, Leroy Folmar, Michael Hemmer, Sigmund Degitz, and Joseph 
Tietge (2001) Biomonitoring the Toxicogenomic Response to Endocrine Disrupting Chemicals in 
Humans Laboratory Species and Wildlife. U.S. - Japan International Workshop for Endocrine 
Disrupting Chemicals, February 28 U, -Maich 3 rd , 2001, Tsukuba, Japan. 

(i0) John C. Rockett, Faye L. Mapp, J. Brian Garges, J. Christopher Luft, Chisato Mori and David 
J Dix David Dix (2001). The effects of hyperthermia on spermatogenesis, apoptosis, gene 
expression and fertility in adult male mice. Triangle Consortium for Reproductive Biology Annual 
Meeting, January 27 th , 2001, RTP, NC, USA. 

(9) Gangolli E, Dix DJ, Garges J B, Rockett, JC and Idzerda RL (2000). Testosterone Regulation 
of Sertoli Cell'genes. 11 th International Congress of Endocrinology, October 29 l -November 2 , 
2000, Sydney, Australia. 

(8) J Rockett, J Luft, J Garges, M Ricci, P Patrizio, N Hecht and D Dix (2000). Using DNA 
microarrays to characterise gene expression in testes of fertile and infertile humans and mice. 
Functional Genomics & Microarray Data Mining, August 3 rd -4th th 2000, Durham, NC, USA. 

(7) Rockett JC S Ricci, P Patrizio, NB Hecht, JB Garges and DJ Dix (2000). Gene Expression in 
the Mammalian' Testis. 5 ,h NHEERL Symposium, June 6 th -8 ,h , 2000, RTP, NC, USA. 

(6) J Luft J B Garges, J Rockett and D Dix (2000). Male reproductive toxicity of ^ 
bromochloroacetic acid in mice. 2000 NIEHS/NTA Biomedical Science and Career Fair, April 28 
2000, RTP, NC, USA. 

(5) Rockett JC, S Ricci, P Patrizio, NB Hecht, JB Garges and DJ Dix (2000). Gene Expression in 
the Mammalian Testis. Molecular Toxicology, Toxicogenomics and Associated Bioinformatics 
Applied to Drug Discovery meeting, January 1 1^-15^, 2000,Santa Fe,NM, USA. 

(4) JC Rockett and DJ Dix (1999). Development of DNA arrays for the analysis of testis-expressed 
genes in humans and mice. The 8th Annual National Health and Environmental Effects Research 
Laboratory Open House. November 2 nd -3 rd , 1 999, RTP, NC, USA. 

(3) JC Rockett, DJ Esdaile and GG Gibson (1997). Molecular profiling of non-genotoxic 
carcinogenesis using differential display reverse transcription polymerase chain reaction (ddRT- 
PCR). TheBritish Toxicology Society Annual Meeting, April 19 ,h -22 nd , 1998, Univeraty of Surrey, 
Guildford, Surrey, England. 

(2) JC Rockett, DJ Esdaile and GG Gibson (1997). Molecular profiling of non-genotoxic 
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carcinogenesis using differential display reverse transcription polymerase chain reaction (ddRT- 
PCR). Poster presentation at Symposium on Drug Metabolism: Towards the next Millennium. 
August 26 th -28 lh , 1 997, London King's College, London, England. 

(1) J Rockett, S Darnton, J Crocker, H Matthews and A Morris: Major Histocompatibility Complex 
(MHC) class 1 and n and Intercellular Adhesion Molecule (ICAM)-l expression in oesophageal 
carcinoma. Oral presentation at The 6th World Congress of the International Society for Diseases of 
the Esophagus, August 23 rd -26 th , 1995, Milan, Italy. 
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Axllp sequence toflowng Ser 300 and occurs withri 
tnecomain of Arilp thai shows rxxnotogy with NDE 
(14). To detete the complete STB23 sequence end 
create the sra23A.vURA3 mutation , polymerase chain 
reaction {PCR) primers (S'-TCGGAAGACCTCAT- 
TCTTGCTC*TTTTGATATTGCTC- TGTAGATTG- 
TACTGAGAGTGCAC-3' ; and 5'-GCTACAAACAGC- 
GTCGACT TQAATGCCCCGACATCT TCQACTGT- 
G0G6TATTTCACA0CG-3') were used to ampfify 
the LR43 sequence of DRS316. and the reaction 
product was transtormed Into yeast tor one-step gene 
replacement (R Rothstein, Methods Enzyme*. 194. 
261 (1991]!). To create theaxtt k:LBJ2 mutation con- 
tained on pi 14, a SXMcb So) I fragment from pAXL 7 
was cloned Into pUCl9. and an interna* 4D-kb Hpa 
l-Xno I fragment was replaced wttfi a LEU2 fragment. 
To construct the ste23A:±BJ2 aUeto (a deletion cor- 
respondng to 931 amino acids} carried on pi 53, e 
LBU2 fragment was used to replace the 2.8-kb Pml 
f-Ed136 1 fragment ctSTE23, which occurs within a 
6.2-Kb Hind Bl-Bgt II genomic fragment carried on 
pSP72 (Promega). To create YEpMFAl, a 1.6-kb 
Bam HI fragment containing MFA1, from pKKl6 fK. 
Kuchter, R E Sterne, J. Thomer. 6MBO J. 8. 3973 
(1989a, was (gated into the 8am HI siteofYEp35l [J. 
E Hfl, A. M. Myers, T. J. Koemer, A. Tiagotofl, Yeast 
Z 163 (1966)). 

24. J. Chant and I. Herskowrtz, Ceil 65, 1203(1991). 

25. B. W. Matthews, Ace Chem. Res. 21. 333 (1988). 

26. K. Kuchter. H. a Dohtrnan, J. Thomer-, J.Cat&oL 
120, 1203 (1993); a Koffirtg and C. P. HoUenberg, 
BMBOJ. 13, 3261 (1994); C. Berkower, D. Loeyza, 
S. Michaefis, Mot. Biol. Caff 6, 1 1 85 (1994). 

27. A. Bender and J. R. Pringle, Proc. NatL Acad. Set 
U.SA 66. 9976 (1989); J. Chant, K. Corrado, J. R. 
Pringle, I. Herskowrtz, Cef 65. 1213 (1991); S. 
Powers, E Gonzales. T. Christensea J. Cubert. D. 
Broek, ibid., p. 1225; H. O. Park, J. Chant. I. Her- 
skowltz. Nature 365, 269 (1993); J. Chant. Trends 

Genet 10, 328(1994); and J. RPringie.jL 

CeBBiol. 129. 751 (1995); J. Chant. M. Mfechke, E 
Mitchell. I. Herskowrtz, J. R. Pringle. lbkS. t p. 767. 

28. a F. Sprague Jr., Methods. EnzymoL 194, 77 
0991). 

29. Single-letter abbrevi ati ons for the amino acid resi- 
dues are as follows: A, Ala; C, Cys: D. Asp; E.GJu; F 
Pne: a Gfy; H. His; I. He; K. Lys; L Leu; M. Met; N, 
Asa P, Pro; O. Gh; R. Arg; S, Sen T. Thr; V t Vai; W, 
Trp;andY,Tyr. 

30. A W3Q3 1A derivBtiwe. SY2625 (/MAT* tn3-1 taJ2-3. 
112trp1-lack£-1canl-lOOsstl6m&&:.'RJS1'tacZ 
hts3teJVSl-HIS3l. was the parent strain tor the mutant 
search. SY2625 derivatas tor the mating assays, se- 
creted pheromone assays, and the pUse-chase exper- 
iments houoad the Mowing strains: Y49 (sfa22-7), 
Y115 («BlA:XajC), Y142 W.VRAQ. Y173 
frdllLdBJZU Y220 1&VJJRA3 S/B23A.7LR4J), Y221 
(Jfa23^.t/flA3), V231 1futlk±BJ2 sta23L:UEV2). 
and Y233 {Ste23Arl£U2y MA To derivatives of 
SY2625 hduded the foOowng strains: Y199 
(SY2625 made MAT<$, Y278 {Ste22-1) t Y195 
frfa1te£BJ2). Y196 \fixilL:;LEV2). and Y197 
(ajrf/:.-US4j). The EG 1 23 (M4fa Ieu2 ura3 trpl cant 
his4i genetic background was used to create a set of 
strains tar analysis of bud site selection. EG123 de- 
rivatives included the fotowtng strains: Y175 
(a**4.vLR/2). Y223 (arfl;;fJWA3). Y234 (sfa23Av 
LBJ2\ and Y272 (a*rfJA;;L£U? Ste23&.-:LEU2). 
MATq derivatives of EG 123 included the blowing 
strains: Y214 (EG 123 made AM 7a) and Y293 
(a*HA::l£l/2). AO strains were generated by means 
of standard genetic or rndecUar methods InvoMng 
the appropriate constructs (23). In particular, the turf 
sfe23 double mutant strains wve creeled by cmsa- 
ing of the appropriate MATm ste23 and MATa axil 
mutants, followed by sporutation of the resultant ctp- 
loid and isolation of the double mutant from nonpe* 
rental di-type tetrads. Gene dsrupttans were con- 
firmed with either PCR or Southern (DMA) analysts. 

31. p129 is a YEp352 (J. E Hi, A. M Myers, T. J. Ko- 
emer. A. Tzagotoff, Yeast 2. 1 63 (1986)] ptamd con* 
tainng a 5.5^ SaH fragment* J. p151 was 
derived from p 1 29 by insertion of a Wear at the Bgf I 
site withh AXL f. which led to an in-frame insertion of 
the hemaggltinin (HA) epitope (TXJVPYDvPDYA) (29) 
between amino acids 854 and 655 of the AXL / prod- 



uct PC225 Is a KS+ (Stratagene) piasrrt contattng 
a 0.5-kb Bam Ht-Sst l fragment from pAXZ. J. Substi- 
tution mutations of the proposed active site of Axllp 
wgo geatcd with the use of pC225 and site-specific 
nxrtagenesis nvoMng appropriate synthetic oigonu- 
deotktes iftxtUH€QA t 5 ' -GTGCTGAGAAAGCGCT- 
GCCAAACCGGC-3'; axt1-E7lA, S'-AAGAATCAT- 
GTGCGCACAAAGGTGCGC-3': and axt1-£71D, 5'- 
AAGAATCATGTGATCaCAAAGGTGCGC-3 The 
mutations were confirmed by sequence analysis. Af- 
ter mutagenesis, the 0.4-fcb Bam HHvtec I fragment 
from the mutegeneed pC225 ptasrrtds was trans- 
ferred into pAXL 1 to create a set of pRS3l 6 ptasmkte 
carrying different AXL 7 alleles, pi 24 <&d1-H68A) 
P130 M-f7M), and pi 32 (ax/ f -£770). Sfrnterty, a 
set of HA-tagged aletes carried on YEp352 were cre- 
ated after replacement of the pl5l Bam HWvtec I 
fragment to generate pi 61 (axfl-£7JA), pi62 1axn- 
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Quantitative Monitoring of Gene Expression 
Patterns with a Complementary DNA Microarray 

Mark Schena/ Dari Shalon.'t Ronald W. Davis, 
Patrick O. Brownt 

n a rino.'M? adty SySt6m was looped to monitor the expression of many genes In 

B^u~ m ?h S" antrtatlve expression measurements of the correspond^ genes 
mt^^l the small format and high density of the arrays, hybridteatioVvolumes of 2 
b8 US6d detection of rare^anscripts In proCe n^tur« 

meSJX/of ^T 9 ^ 8 01 t0,a ' ce,lular ^essenser RNA. Differencial express^ 



The temporal, developmencal, topographi- 
cal, histological, and physiological patterns 
in which a gene is expressed provide clues to 
its biological role. The large and expanding 
database of complementary DNA (cDNA) 
sequences from many organisms ( I ) presents 
the opportunity of defining these patterns at 
the level of the whole genome. 

For these studies, we used the small flow- 
ering plant Arabidopsis thaliana as a model 
organism, Arabidopsis possesses many ad- 
vantages for gene expression analysis, in- 
cluding the fact that it has the smallest 
genome of any higher eukaryote examined 
to date (2). Forty.fi ve cloned Arabidopsis 
cDNAs (Table 1), including 14 complete 
sequences and 31 expressed sequence tags 
(^ STs ). w cre used as gene-specific targets. 
We obtained the ESTs by selecting cDNA 
clones at random from an Arabidopsis 
cDNA library. Sequence analysis revealed 
that 28 of the 31 ESTs matched sequences 
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in the database (Table 1). Three additional 
cDNAs from other organisms served as con- 
trols in the experiments. 

The 48 cDNAs, averaging -1.0 kb, 
were amplified with the polymerase chain 
reaction (PCR) and deposited into indi- 
vidual wells of a 96-well microtiter plate. 
Each sample was duplicated in two adja- 
cent wells to allow the reproducibility of 
the arraying and hybridization process to 
be tested. Samples from the microtiter 
plate were printed onto glass microscope 
slides in an area measuring 3.5 mm by 5.5 
mm with the use of a high-speed arraying 
machine (3). The arrays were processed by 
chemical and heat treatment to attach the 
DNA sequences to the glass surface and 
denature them (3). Three arrays, printed 
in a single lot, were used for the experi- 
ments here. A single microtiter plate of 
PCR products provides sufficient material 
to print at least 500 arrays. - 

Fluorescent probes were prepared from 
total Arabidopsis rnRNA (4) by a single 
round of reverse transcription (5). The Ara- 
bidopsis rnRNA was supplemented with hu- 
man acetylcholine receptor (AChR) rnRNA 
at a dilution of 1 : 10,000 (w/w) before cDNA 
synthesis, to provide an internal standard for 
calibration (5). The resulting fluorescently 
labeled cDNA mixture was hybridued to an 
array at high stringency (6) and scanned 
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with a laser (3). A high-sensitivity scan gave 
signab that saturated the detector at nearly 
alt of the Arabidopsis target sites (Fig. 1A). 
Calibration relative to the AChR mRNA 
standard (Fig. IA) established a sensitivity 
limit of — 1 : 50,000. No detectable hybridiza- 
tion was observed to either the rat glucocor- 
ticoid receptor (Fig. I A) or the yeast TRP4 
(Fig. 1A) targets even at the highest scan- 
ning sensitivity* A moderate-sensitivicy scan 



of the same array allowed linear detection of 
the more abundant transcripts (Rg. IB). 
Quantitation of both scans revealed a range 
of expression levels spanning three orders of 
magnitude for the 45 genes tested (Table 2). 
RNA blots (7) for several genes (Fig. 2) 
corroborated the expression leveb measured 
with the microarray to within a factor of 5 
(Table 2). 

Differential gene expression was invest i- 
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gated with a simultaneous, two-color hy- 
bridization scheme, which served to mini* 
mize experimental variation inherent in the 
comparison of independent hybridizations. 
Fluorescent probes were prepared from two 
mRNA sources with the use of reverse tran- 
scriptase in the presence of fluorescein- and 
lissamine-labeled nucleotide analogs, re- 
spectively (5). The two probes were then 
mixed together in equal proportions, hy- . 
bridized to a single array, and scanned sep- 
arately for fluorescein and lissamine emis- 
sion after independent excitation of the two 
fluorophores (3). 

To test whether overexpression of a sin- 
gle gene could be detected in a pool of total 
Arabidopsis mRNA, we used a microarray to 
analyze a transgenic line overexpressing the 
single transcription factor HAT4 (8). Fluo- 
rescent probes representing mRNA from 
wild-type and HAT4-transgenic plants were 
labeled with fluorescein and lissamine, re- 
spectively; the two probes were then mixed 
and hybridized to a single array. An intense 
hybridization signal was observed at the 
position of the HAT4 cDN A in the lissa- 
mine-specific scan (Fig. ID), but not in the 
fluorescein-specific scan of the same array 
(Fig. 1C). Calibration with AChR mRNA 
added to the fluorescein and lissamine 
cDNA synthesis reactions at dilutions of 
1:10,000 (Fig. 1C) and 1:100 (Fig. ID), 
respectively, revealed a 50-fold elevation of 
HAT4 mRNA in the transgenic line rela- 
tive to its abundance in wild-type plants 
(Table 2). This magnitude of HAT4 over* 
expression matched that inferred from the 
Northern (RNA) analysis within a factor of 
2 (Fig. 2 and Table 2). Expression of all the 
other genes monitored on the array differed 
by less than a factor of 5 between HAT4- 
transgenic and wild-type plants (Fig 1, C 



■:• • & «fr G ft 
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1:10,000 



Rg. 1 . Gene expression monitored with the use of cDNA rnicroarrays. Fluorescent scans represented in 
pseudocolor correspond to hybridization intensities. Color bars were calibrated from the signal obtained 
with the use of known concentrations of human AChR mRNA in independent experiments. Numbers and 
letters on the axes mark the position of each cDNA. (A) High-sensitivity fluorescein scan after hybridization 
with fluorescein-labeled cDNA derived from wild-type plants. (B) Same array as in (A) but scanned at 
moderate sensitivity. (C and D) A single array was probed with a 1 : 1 mixture of fluorescetviabeled cDNA 
from wild-type plants and lissamine-labeled cDNA from HAT4 -transgenic plants. The single array was 
then scanned successively to detect the fluorescein fluorescence corresponding to mRNA from wild- type 
plants (C) end the lissamine fluorescence corresponding to mRNA from HAT4 -transgenic plants (D). (E 
and F) A single array was probed with a 1:1 mixture of fluorescein-labeled cDNA from root tissue and 
lissamine-labeled cDNA from leaf tissue. The single array was then scanned successively to detect the 
fluorescein fluorescence corresponding to mRNAs expressed in roots (E) and the lissamine fluorescence 
corresponding to mRNAs expressed in leaves (F). 



Wild type transgenic 



CABI 



HAT4 



ROC1 




7.- : ?*M-i;>l?:? : ; 



1.0 0.1 0.01 1.0 0.1 0.01 
mRNA (jig) 



Human 
AChR 



20 2.0 0.2 
mRNA (ng) 

Fig. 2. Gene expression monitored with RNA 
(Northern) blot analysis. Designated amounts of 
mRNA from wild-type and HA 74 .transgenic 
plants were spotted onto nylon membranes and 
probed with the cONAs indicated. Purified human 
AChR mRNA was used tor calibration. • 
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and D, and Tabic 2). Hybridization of flu- 
orescein-labelcd glucocorticoid receptor 
cDNA (Fig. 1C) and lissamine-labeled 
TRP4 cDNA (Fig. ID) verified the pres. 
ence of the negative control targets and the 
lack of optical cross talk between the two 
fluorophores. 

To explore a more complex alteration in 
expression patterns, we performed a second 
two-color hybridization experiment with 
fluorescein- and lissamine-labeled probes 
prepared from root and leaf mRNA, respec- 
tively. The scanning sensitivities for the 
two fluorophores were normalized by 
matching the signals resulting from AChR 



mRNA, which was added to both cDNA 
synthesis reactions at a dilution of 1:1000 
(Fig. 1 , E and F). A comparison of the scans 
revealed widespread differences in gene ex-, 
pression between root and leaf tissue (Fig. 1, 
E and F). The mRNA from the light-regu- 
lated CABJ gene was -500-fold more abun- 
dant in leaf (Fig. IF) than in root tissue 
(Fig. IE). The expression of 26 other genes 
differed between root and leaf tissue by 
more than a factor of 5 (Fig. 1, E and F). 

The HAT4-transgenic line we examined 
has elongated hypocotyb, early flowering, 
poor germination, and altered pigmentation 
(8). Although changes in expression were 



Table 1. Sequences contained on the cDNA microarray. Shown is the position, the known or putative 
function, and the accession number of each cDNA in the microarray Fig. 1 ). All but three of the ESTs used 
in this study matched a sequence in the database. NADH, reduced form of nicotinamide adenine 
dinucleotide; ATPase, adenosine triphosphatase; GTP, guanosine triphosphate. 



Position 


cONA 


a1,2 


AChR 


a3, 4 


EST3 


a5,6 


EST6 


a7,8 


AAC1 


ad, 10 


EST12 


all, 12 


EST13 


bl.2 


GAB/ 


b3,4 


EST17 


D5.6 


GA4 


t-7.8 


EST19 


b9, 10 


GBF-1 


oil, 12 


EST23 


C1,2 


EST29 


C3.4 


GBF-2 


C5.6 


EST34 


C7.8 


EST35 


c9, 10 


EST41 


C11. 12 


rGR 


d1.2 


EST42 


d3,4 


EST45 


d5.6 


HAT1 


d7 ( 8 


EST46 


d9, 10 


EST49 


d11, 12 


HAT2 


e1,2 


HAT 4 


e3.4 


EST50 


e5,6 


HATS 


e7,8 


EST51 


e9, 10 


HAT22 


e11, 12 


EST52 


11.2 


EST59 


f3.4 


KNAT1 


f5,6 


EST60 


f7,8 


EST69 


19, 10 


PPH1 


111.12 


EST 70 


gi.2 


EST 75 


g3,4 


EST 78 


95,6 


ROC1 


97,8 


EST82 


g9, 10 


EST83 


gn.12 


EST84 


hl,2 


EST91 


h3,4 


EST96 


h5,6 


SARI 


h7,8 


EST100 


h9, 10 


EST103 


M1, 12 


TRP4 



Function 



Human AChR 
Actin 

NADH dehydrogenase 
Actin 1 
Unknown 
Actin 

Chlorophyll a/b binding 
Phosphogrycerate kinase 
Gtoberellic acid biosynthesis 
Unknown 

G-box binding factor 1 
Elongation factor 
Aldolase 

G-box binding factor 2 
Chbroplast protease 
Unknown 
Cataiase 

Rat glucocorticoid receptor 
Unknown 
ATPase 

Homeobox-leucine zipper 1 
Light harvesting complex 
Unknown 

Homeobox-leucine zipper 2 
Homeobox-leucine zipper 4 
PtosphC4ibulokinase 
Homeobox -leucine zipper 5 
Unknown 

Homeobox-leucine zipper 22 
Oxygen evolving 
Unknown 

Knotted-Kke homeobox 1 
RuBisCO small subunrt 
Translation elongation factor 
Protein phosphatase 1 
Unknown 

Chbroplast protease 
Unknown 
Cyclophilin 
GTP binding 
Unknown 
Unknown 
Unknown 
Unknown 
Synaptobrevin 
Light harvesting complex 
Light harvesting complex 
Yeast tryptophan biosynthesis 

'Proprietary sequence ol Stratagene {La Jdla, CalBomia). 



Accession 
number 



H36236 

227010 

M20016 

U36594t 

T45783 

M85150 

T44490 

L37126 

U36595t 

X63894 

X52256 

T04477 

X63895 

R87034 

T14152 

T22720 

M14053 

U36596t 

J04185 

U09332 

T04063 

T 76267 
U09335 
M90394 

T04344 

M90416 

Z33675 

U09336 ' 

T21749 

Z34607 

U14174 

X14564 

T42799 

U34803 

T44621 

T43698 

R65481 

L14844 

X59152 

233795 

T45278 . 

T13832 

R64816 

M90418 

218205 

X03909 

X04273 



observed for HAT4, large changes in ex- 
pression were not observed for any of the 
other 44 genes we examined. This was 
somewhat surprising, particularly because 
comparative analysis of leaf and root tissue 
identified 27 differentially expressed genes. 
Analysis of an expanded set of genes may be 
required to identify genes whose expression 
changes upon HAT4 overexpression; alter- 
natively, a comparison of mRNA popula- 
tions from specific tissues of wild-type and 
-transgenic plants may allow identi- 
fication of downstream genes. 

At the current density of robotic printing, 
it is feasible to scale up the fabrication pro- 
cess to produce arrays containing 20,000 
cDNA targets. At this density, a single array 
would be sufficient to provide gene-specific 
targets encompassing nearly the entire rep- 
ertoire of expressed genes in the Arabidopsis 
genome (2). The availability of 20,274 ESTs 
from Arabidopsis (i, 9) would provide a rich 
source of templates for such studies. 

The estimated 100,000 genes in the hu- 
man genome (10) exceeds the number of 
Arabidopsis genes by a factor of 5 (2). This 
modest increase in complexity suggests that 
similar cDNA microarrays, prepared from 
the rapidly growing repertoire of human 
ESTs (J), could be used to determine the 
expression patterns of tens of thousands of 
human genes in diverse cell types. Coupling 
an amplification strategy to the reverse 
transcription reaction (II) could make it 
feasible to monitor expression even in 
minute tissue samples. A wide variety of 
acute and chronic physiological and patho- 
logical conditions might lead to character- 
istic changes in the patterns of gene expres- 
sion in peripheral blood cells or other easily 
sampled tissues. In concert with cDNA mi- 
croarrays for monitoring complex expres- 
sion patterns, these tissues might therefore 
serve as sensitive in vivo sensors for clinical 
diagnosis. Microarrays of cDNAs could thus 
provide a useful link between human gene 
sequences and clinical medicine. 



Table 2. Gene expression rrxxTrtorrjg by microar- 
ray and RNA blot analyses; tg, HAT^-transgenic. 
See Table 1 for additional gene information. Ex- 
pression levels (w/w) were calibrated with the use 
of known amounts of human AChR mRNA. Values 
for the microarray were determined from rrucroar- 
ray scans (Fig. 1); values for the RNA Wot were 
determined from RNA blots (Rg. 2). 



Gene 



Expression level (wAv) 



tNo match in the database; neve! EST. 





Microarray 


RNA blot 


CABi 


1:48 


1:83 


CABI (tg) 


1:120 


1:150 


HAT 4 


1:8300 


1:6300 


HAT4 (tg) 


1:150 


1:210 


ROC1 


1:1200 


1:1800 


ROC1 (tg) 


1:260 


1:1300 
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Gene Therapy in Peripheral Blood 
Lymphocytes and Bone Marrow for 
ADA Immunodef icient Patients 

Ciaudio Bordignon,* Luigi D. Notarangelo, Nadia Nobili, 
Giuliana Ferrari, Giulia Casorati, Paola Panina, Evelina Mazzolari, 
Daniela Maggioni, Claudia Rossi, Paolo Servida, 
Alberto G. Ugazio, Fulvio Mavilio 

Adenosine deaminase (ADA) deficiency results in severe combined immunodeficiency, 
the first genetic disorder treated by gene therapy. Two different retroviral vectors were 
used to transfer ex vivo the human ADA minigene into bone marrow cells and peripheral 
blood lymphocytes from two patients undergoing exogenous enzyme replacement ther- 
apy. After 2 years of treatment, long-term survival of T and B lymphocytes, marrow cells, 
and granulocytes expressing the transferred ADA gene was demonstrated and resulted 
in normalization of the immune repertoire and restoration of cellular and humoral immunity. 
After discontinuation of treatment, T lymphocytes, derived from transduced peripheral 
blood lymphocytes, were progressively replaced by marrow-derived T cells in both pa- 
tients. These results indicate successful gene transfer into long-lasting progenitor cells, 
producing a functional multilineage progeny. 



Severe combined immunodeficiency asso- 
ciated with inherited deficiency of ADA 
(J) is usually fatal unless affected children 
are kept in protective isolation or the im- 
mune system is reconstituted by bone mar- 
row transplantation from a human leuko- 
cyte antigen (HLA)-identical sibling donor 
(2). This is the therapy of choice, although 
it is available only for a minority of patients. 
In recent years, other forms of therapy have 
been developed, including transplants from 
haploidentical donors (3, 4), exogenous en- 
ryrne replacement (5), and somatic-cell 
gene therapy (6-9). 

We previously reported a preclinical mod- 
el in which ADA gene transfer and expression 
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successfully restored immune functions in hu- 
man ADA-deficient (ADA") peripheral 
blood lymphocytes (PBLs) in immunodefi- 
cient mice in vivo (J0 f J J J. On the basis of 
these preclinical results, the clinical applica- 
tion of gene therapy for the treatment of 
ADA" SOD (severe combined immunodefi- 
ciency disease) patients who previously failed 
exogenous enxyme replacement therapy was 
approved by our Institutional Ethical Com- 
mittees and by the Italian National Commit- 
tee for Bioethics (12). In addition to evaluat- 
ing the safety and efficacy of the gene therapy 
procedure, the aim of the study was to define 
. the relative role of PBLs and hematopoietic 
stem cells in the long-term reconstitution of 
immune functions after retroviral vector-me- 
diated ADA gene transfer. For this purpose, 
two structurally identical vectors expressing 
the human ADA complementary DNA 
(cDNA), distinguishable by the presence of 
alternative restriction sites in a nonfunctional 
region of the viral long-terminal repeat 
(LTR), were used to transduce PBLs and bone 
marrow (BM) cells independently. This pro- 
cedure allowed identification of the origin of 
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cell's development or response and should help in the elucidation of specific and 
sensitive biomarkers representing, for example, different types of cancer or previous 
exposure to certain classes of chemicals that are enzyme inducers. 

In drug metabolism, many of the xenobiotic-metabolizing enzymes (including 
the well-characterized isoforms of cytochrome P450) are inducible by drugs and 
chemicals in man (Pelkonen et al. 1998), predominantly involving transcriptional 
activation of not only the cognate cytochrome P450 genes, but additional cellular 
proteins which may be crucial to the phenomenon of inductioa Accordingly, the 
development of methodology to identify and assess the full complement of genes 
that are either up- or down-regulated by inducers are crucial in the development of 
knowledge to understand the precise molecular mechanisms of enzyme induction 
and how this relates to drug action. Similarly, in the field of chemical-induced 
toxicity, it is now becoming increasingly obvious that most adverse reactions to 
drugs and chemicals are the result of multiple gene regulation, some of which are 
causal and some of which are casually -related to the toxicological phenomenon /x?r 
se. This observation has led to an upsurge in interest in gene-profiling technologies 
which differentiate between the control and toxin-treated gene pools in target tissues 
and is, therefore, of value in rationalizing the molecular mechanisms of xenobiotic- 
induced toxicity. Knowledge of toxin-dependent gene regulation in target tissues is 
not solely an academic pursuit as much interest has been generated in the 
pharmaceutical industry to harness this technology in the early identification of toxic 
drug candidates, thereby shortening the developmental process and contributing 
substantially to the safety assessment of new drugs. For example, if the gene profile 
in response to say a testicular toxin that has been well-characterized in vivo could be 
determined in the testis, then this profile would be representative of all new drug 
candidates which act via this specific molecular mechanism of toxicity, thereby 
providing a useful and coherent approach to the early detection of such toxicants. 
Whereas it would be informative to know the identity and functionality of all genes 
up/down regulated by such toxicants, this would appear a longer term goal, as the 
majority of human genes have not yet been sequenced, far less their functionality 
determined. However, the current use of gene profiling yields a pattern of gene 
changes for a xenobiotic of unknown toxicity which may be matched to that of well- 
characterized toxins, thus alerting the toxicologist to possible in vivo similarities 
between the unknown and the standard, thereby providing a platform for more 
extensive toxicological examination. Such approaches are beginning to gain 
momentum, in that several biotechnology companies are commercially producing 
•gene chips' or 'gene arrays' that may be interrogated for toxicity assessment of 
xenobiotics. These chips consist of hundreds/thousands of genes, some of which are 
degenerate in the sense that not all of the genes are mechanistically-related to any 
one toxicological phenomenon. Whereas these chips are useful in broad -spectrum 
screening, they are maturing at a substantial rate, in that gene arrays are now 
becoming more specific, e.g. chips for the identification of changes in growth factor 
families that contribute to the aetiology and development of chemically-induced 
neoplasias. 

Although documenting and explaining these genetic changes presents a 
formidable obstacle to understanding the different mechanisms of development and 
disease progression, the technology is now available to begin attempting this difficult 
challenge. Indeed, several 'differential expression analysis' methods have been 
developed which facilitate the identification of gene products that demonstrate 
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altered expression in cells of one population compared to another. These methods 
have been used to identify differential gene expression in many situations, including 
invading pathogenic microbes (Zhao et al. 1998), in cells responding to extracellular 
and intracellular microbial invasion (Duguid and Dinauer 1990, Ragno et al 1997 
Maldarelli et al. 1998), in chemically treated cells (Syed et al. 1997, Rockett et al 
1999), neoplastic cells (Liang et al. 1992, Chang and Terzaghi-Howe 1998) 
activated cells (Gurskaya et al. 1996, Wan et al. 1996), differentiated cells (Hara et 
al 1991, Guimaraes et al. 1995a, b), and different cell types (Davis et al 1984 
Hedrick et al. 1984, Xhu et al. 1998). Although differential expression analysis 
technologies are applicable to a broad range of models, perhaps their most important 
advantage is that, ,n most cases, absolutely no prior knowledge of the specific genes 
which are up- or down-regulated is required. 

The field of differential expression analysis is a large and complex one, with 
many techniques available to the potential user. These can be categorized into 
several methodological approaches, including : 

(1) Differential screening, 

(2) Subtractive hybridization (SH) (includes methods such as chemical cross- 
linking subtraction-GCLS, suppression-PCR subtractive hybridization- 
bbri, and representational difference analysis— R DA), 

(3) Differential display (DD), ' 

(4) Restriction endonuclease facilitated analysis (including serial analysis of gene 
expression— SAGE— and gene expression fingerprinting— GEF), 

(5) Gene expression arrays, and 

(6) Expressed sequence tag (EST) analysis. 

The above approaches have been used successfully to isolate differentially 
expressed genes m different model systems. However, each method has its own 
subtle (and sometimes not so subtle) characteristics which incur various advantages 
and disadvantages. Accordingly, it is the purpose of this review to clarify the 
mechanistic principles underlying the main differential expression methods and to 
highlight some of the broader considerations and implications of this very powerful 
and increasingly popular technique. Specifically, we will concentrate on the so- 
called open systems, namely those which do not require any knowledge of gene 
sequences and, therefore, are useful for isolating unknown genes. Two 'closed' 
systems (those utilising previously identified gene sequences), EST analysis and the 
use of DNA arrays, will also be considered briefly for completeness. Whilst 
emphasis will often be placed on suppression PCR subtractive hybridization (SSH 
the approach employed in this laboratory), it is the aim of the authors to highlight' 
wherever possible, those areas of common interest to those who use, or intend to use' 
differential gene expression analysis. ' 



Differential cDNA library screening (DS) 

Despite the development of multiple technological advances which have recently 
brought the field of gene expression profiling to the forefront of molecular analysis 
recognition of the importance of differential gene expression and characterization of 
differentially expressed genes has existed for many years. One of the original 
approaches used to identify such genes was described 20 years ago by St John and 
Davis (1979). These authors developed a method, termed 'differential plaque filter 
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hybridation , winch was used to isolate galactose-inducible DNA sequences from 
yeast. The theory ,s simple: a genomic DNA library is prepared from nor IT 
unstimulated cells of the test organism/tissue ^^^Z^Tt 

*:::z ^cdnT^ 3 blots ai :r h i with radioactive,y <« 

complex cDNA probes prepared from the control and test cell mRNA populations 
Those mRNAs which are differentially expressed in the treated cell population w" l 
show a positive signal only on the filter probed with cDNA from the treated ceTk 

ZtT h Tr' , I" CDNA fr ° m different tCSt COnditions «» ^ used to probe 
multiple blots, thereby enabling the identification of mRNAs which are only up 
regulated under certain condition.. For example, St John and Davis (1 979) screened 
rephca filter, with acetate-, glucose- and galactose-derived probes in order to obtain 

by gal3Ct0Se "^holism. Although groundbreak ng m iS 
time this method is now considered insensitive and time-consuming as un to 2 
months are required to complete the identification of genes which are different ally 

ZT m / P ° PUlati0n - In additl ° n > therC is "° ""venient w^ o check 
that the procedure has worked until the whole process has been completed 

Subtractive Hybridization (SH) 

The developing concept of differential gene expression and the success of earlv 
approaches such as that described by St John and Davis (1979) soon gav ri e a 
search for more convenient methods of analysis. One of the first to be developed wa 
SH numerous variations of which have since been reported (see below), n ge ne TaT 

oe" 

,° * v I A* A/C ? NA from another (^iver), followed by separation of the 
unhybr^ized tester fraction (differentially expressed) from the hybridized common 
sequences. This step has been achieved physically, chemically and through th "use 
of selective polymerase chain reaction (PCR) techniques. 

Physical separation 

ofhv ) h ig H nal H SUbtraCtiVe hybridization technology involved the physical separation 
o hybndized common species from unique single stranded species. Sevend meAoS 

S^ deS d Crib H ed ' indUding chromatogtph 

In^SrT it* } .' aVldm - blotin technology (Duguid and Dinauer 1990) 

and ol.godT-l.tex separation (Hara et al. 1991). In the first approach common 
mRNA specie, are removed by cDNA (from test cells)- m RNA (1 com XTlS 
subtractive hybridization followed by hydroxyapatne chromatography, as hydro " 
tCllTt J ad r bS CDNA - R NA hybrids. The unabsorbed cDNA is 
then used either for the construction of a cDNA library of differentially expressed 
genes (Sargent and Daw ld 1983, Schneider et al. 1988) or directly as a I 
creen a preselected hbrary (Zimmerman et al. 1 980, Davis et al. 1 984, Hedrickl/ 
1984). A schemata diagram of the procedure is shown in figure 1 

PCR^~ t^^Z TZtT, * 

»„o /a *u , , eJO P ed as a means to overcome some of the problems 
encountered with the hvdroxvanntitP r.T-«o«»^ i? , ^ p UUIcms 

(1990) described a methnH T procedure. For example, Daguid and Dinauer 
„ ) desc " bed a method of subtraction utilizing biotin-affinity systems as a means 
to remove hybridized common sequences. In this process, both the ontroTand 
tester mRNA populations are first converted to cDNA and an adaptor ('0^^ 
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Control (driver) mRNA 
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-AAAA 
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Tester (test) cDNA (1st strand) 



-AAAA 



•TTTTT 
TTTTT 
TTT 
-TTTTT 



Mix (ratio >35:1)& hybridize 



-AAAA 



-AAAA 



-AAAA 
— AAAA 

-TTTT 



Hydroxyapatite chromatography 



RNAxDNA hybrids removed 



Unhybridized 

cDNA (differentially expressed) 
and mRNA 



TTTTT 
=AAAA 



-AAAA 



Sepharose CL6B exclusion 
chromatography 



Small cDNA fragments (<450bp) 



Enriched, differentially expressed cDNA 



or 



Produce clones 



Label directly and probe library 



Figure 1. The hydroxyapatite method of subtractive hybridization. cDNA derived from the 
treated /altered (tester) population is mixed with a large excess of mRNA from the control (driver) 
population. Following hybridization, mRNA-cDNA hybrids are removed by hydroxyapatite 
chromatography. The only cDNAs which remain are those which are differentially expressed in 
the treated/altered population. In order to facilitate the recovery of full length clones, small cDN A 
fragments are removed by exclusion chromatography. The remaining cDNAs are then cloned into 
a vector for sequencing, or labelled and used directly to probe a library, as described by Sargent 
and Dawid (1983). 



containing a restriction site) ligated to both sides. Both populations are then 
amplified by PCR, but the driver cDNA population is subsequently digested with 
the adaptor-containing restriction endonuclease. This serves to cleave the oligo- 
vector and reduce the amplification potential of the control population. The digested 
control population is then biotinylated and an excess mixed with tester cDNA. 
Following denaturation and hybridization, the mix is applied to a biocytin column 
(streptavidin may also be used) to remove the control population, including 
heteroduplexes formed by annealing of common sequences from the tester 
population. The procedure is repeated several times following the addition of fresh 
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Control (driver) mRNA 
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Test (tester) mRNA 
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AAAA- 
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AAAA 
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Tester-specific mRNA retrieved after 
4 rounds of hybridization 
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i 
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Sequence inserts and/or carry out 
other downstream applications 

Figure 2. The use of oligodT^ latex to perform subtractive hybridization. mRNA extracted from the 
control (driver) population is converted to anchored cDNA using polydT oligonucleotides 
attached to latex beads. mRNA from the treated/altered (tester) population is repeatedly 
hybridized against an excess of the anchored driver cDNA. The final population of mRNA is 
tester specific and can be converted into cDNA for cloning and other downstream applications, as 
described by Hara et al. (1991). 
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control cDNA. In order to further enrich those species differentially expressed in 
the tester cDNA, the subtracted tester population is amplified by PCR following 
every second subtraction cycle. After six cycles of subtraction (three reamplification 
steps) the reaction mix is ligated into a vector for further analysis. 

In a slightly different approach, Hara et al. (1991) utilized a method whereby 
ohgo(dT 30 ) primers attached to a latex substrate are used to first capture mRNA 
extracted from the control population. Following 1st strand cDNA synthesis, the 
RNA strand of the heteroduplexes is removed by heat denaturation and centri- 
fugation (the cDNA-oligotex-dT^ forms a pellet and the supernatant is removed) 
A quantity of tester mRNA is then repeatedly hybridized to the immobilized control 
(driver) cDNA (which is present in 20-fold excess). After several rounds of 
hybridization the only mRNA molecules left in the tester mRNA population are 
those which are not found in the driver cDNA-oligotex-dT^ population These 
tester-specific mRNA species are then converted to cDNA and, following the 
addition of adaptor sequences, amplified by PCR. The PCR products are then 
ligated into a vector for further analysis using restriction sites incorporated into the 
PCR primers. A schematic illustration of this subtraction process is shown in figure 

However, all these methods utilising physical separation have been described as 
inefficient due to the requirement for large starting amounts of mRNA, significant 
loss of matenal during the separation process and a need for several rounds of 
hybridization. Hence, new methods of differential expression analysis have recently 
been designed to eliminate these problems. 

Chemical Cross-Linking Subtraction (CCLS) 

In this technique, originally described by Hampson et al. (1992) driver mRNA 
is mixed with tester cDNA (1st strand only) in a ratio of > 20:1. The common 
sequences form cDNA :mRNA hybrids, leaving the tester specific species as single 
stranded cDNA. Instead of physically separating these hybrids, they are inactivated 
chemically using 2,5 diaziridinyl-l,4-benzoquinone (DZQ). Labelled probes are 
then synthesized from the remaining single stranded cDNA species (unreacted 
mRNA species remaining from the driver are not converted into probe material due 
to specificity of Sequenase T7 DNA polymerase used to make the probe) and used 
to screen a cDNA library made from the tester cell population. A schematic diagram 
of the system is shown in figure 3. 

It has been shown that the differentially expressed sequences can be enriched at 
least 300-fold with one round of subtraction (Hampson et al. 1992) and that the 
technique should allow isolation of cDN As derived from transcripts that are present 
at less than 50 copies per cell. This equates to genes at the low end of intermediate 
abundance (see table 1). The main advantages of the CCLS approach are that it is 
rapid, technically simple and also produces fewer false positives than other 
differential expression analysis methods. However, like the physical separation 
protocols, a major drawback with CCLS is the large amount of starting material 
required (at least 10 ^g RNA). Consequently, the technique has recently been 
refined so that a renewable source of RNA can be generated. The degenerate random 
oligonucleotide primed (DROP) adaptation (Hampson et al. 1996, Hampson and 
Hampson 1997) uses random hexanucleotide sequences to prime solid phase- 
synthesized cDNA. Since each primer includes a T7 polymerase promotor sequence 
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FigUre nrS 6 ™ 31 Cross - linkin 8 subtraction. Excess driver mRNA is mixed with 1" strand tester 
<i DNA - The common sequences form mRNArcDNA hybrids which are cross linked with 2 5 
diazindinyl-1 ,4-benzoqumone (DZQ) and the remaining cDNA sequences are differential y 
expressed ,n the tester population . Probes are made from these sequences using Sequenase 2 0 
DNA polymerase, which lacks reverse transcriptase activity and, therefore, does not react with the 
remaming mRNA molecules from the driver. The labelled probes are then used to screen a cDNA 
hbrary for clones of differentially expressed sequences. Adapted from Walter et al. (1996) with 
permission. ™ IU ' 

Tabl e 1. The abundance of mRNA species and classes in a ty pical mammalian cell. 

Mean mass 

P1CS ° f No - ofmRN A Mean % of (ng) of each 
, NA each / species in each species species /pg 
±!f species/cell class in class total RNA 

Abundant 12000 4 33 j ^5 

Intermediate 300 500 0.08 0.04 

J*£I! 15 11000 0.004 0.002 



Modified from Bertioli et al. (1995). 



Differential gene expression 663 

at the 5 'end, the final pool of random cDNA fragments is a PCR-renewable cDNA 
population which is representative of the expressed gene pool and can be used to 
synthesize sense RNA for use as driver material Furthermore, if the final pool of 
random cDNA fragments is reamplified using biotinylated T7 primer and random 
hexamer, the product can be captured with streptavidin beads and the antisense 
strand eluted for use as tester. Since both target and driver can be generated from 
the same DROP product, subtraction can be performed in both directions (i.e. for 
up- and down-regulated species) between two different DROP products. 

Representational Difference Analysis ( RDA ) 

RDA of cDNA (Hubank and Schatz 1994) is an extension of the technique 
originally applied to genomic DNA as a means of identifying differences between 
two complex genomes (Lisitsyn et al. 1993). It is a process of subtraction and 
amplification involving subtractive hybridization of the tester in the presence of 
excess driver. Sequences in the tester that have homologues in the driver are 
rendered unamplifiable, whereas those genes expressed only in the tester retain the 
ability to be amplified by PCR. The procedure is shown schematically in figure 4. 

In essence, the driver and tester mRNA populations are first converted to cDNA 
and amplified by PCR following the ligation of an adaptor. The adaptors are then 
removed from both populations and a new (different) adaptor ligated to the 
amplified tester population only. Driver and tester populations are next melted and 
hybridized together in a ratio of 100:1. Following hybridization, only tester : tester 
homohybrids have 5 'adaptors at each end of the DNA duplex and can, thus, be filled 
in at both 3' ends. Hence, only these molecules are amplified exponentially during 
the subsequent PCR step. Although tester : driver heterohybrids are present, they 
only amplify in a linear fashion, since the strand derived from the driver has no 
adaptor to which the primer can bind. Driver : driver heterohybrids have no 
adaptors and, therefore, are not amplified. Single stranded molecules are digested 
with mung bean nuclease before a further PCR-enrichment of the tester .'tester 
homohybrids. The adaptors on the amplified tester population are then replaced and 
the whole process repeated a further two or three times using an increasing excess of 
driver (Hubank and Shatz used a tester : driver ratio of 1:400, 1:80000 and 
1 : 800000 for the second, third and fourth hybridizations, respectively). Different 
adaptors are ligated to the tester between successive rounds of hybridization and 
amplification to prevent the accumulation of PCR products that might interfere with 
subsequent amplifications. The final display is a series of differentially expressed 
gene products easily observable on an ethidium bromide gel. 

The main advantages of RDA are that it offers a reproducible and sensitive 
approach to the analysis of differentially expressed genes. Hubank and Schatz (1994) 
reported that they were able to isolate genes that were differentially expressed in 
substantially less than 1 % of the cells from which the tester is derived. Perhaps the 
main drawback is that multiple rounds of ligation, hybridization, amplifiation and 
digestion are required. The procedure is, therefore, lengthier than many other 
differential display approaches and provides more opportunity for operator-induced 
error to occur. Although the generation of false positives has been noted, this has 
been solved to some degree by O'Neill and Sinclair (1 997) through the use of HPLC- 
purified adaptors. These are free of the truncated adaptors which appear to be a 
major source of the false positive bands. A very similar technique to RDA, termed 
linker capture subtraction (LCS) was described by Yang and Sytowski (1996). 
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Figure 4. The representational difference analysis (RDA) technique. Driver and tester cDNA are 
digested with a 4-cutter restriction enzyme such as Dpnll. The 1 st set of 12/24 adaptor strands 
(oligonucleotides) are ligated to each other and the digested cDNA products. The 12mer is 
subsequently melted away and the 3'ends filled in using Taq DNA polymerase. Each cDNA 
population is then amplified using PCR, following which the 1 st set of adaptors is removed with 
Dpnll. A second set of 12/24 adaptor strands is then added to the amplified tester cDNA 
population, after which the tester is hybridized against a large excess of driver. The 12mer 
adaptors are melted and the 3'ends filled in as before. PCR is carried out with primers identical 
to the new 24mer adaptor. Thus, the only hybridization products which are exponentially 
amplified are those which are tester: tester combinations. Following PCR, ssDNA products are 
removed with mung bean nuclease, leaving the 'first difference product'. This is digested and a 
third set of 12/24 adaptors added before repeating the subtraction process from the hybridization 
stage. The process is repeated to the 3 rd or 4 th difference product, as described by Lisitsyn et al. 
(1993) and Hubank and Schatz (1994). 
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Suppression PCR Subtractive Hybridization (SSH) 

The most recent adaptation of the SH approach to differential expression 
analysis was first described by Diatchenko et al. (1996) and Gurskaya et al. (1996). 
They reported that a 1000-5000 fold enrichment of rare cDNAs (equivalent to 
isolating mRNAs present at only a few copies per cell) can be obtained without the 
need for multiple hybridizations /subtractions. Instead of physical or chemical 
removal of the common sequences, a PCR-based suppression system is used (see 
figure 5). 

In SSH , excess driver cDNA is added to two portions of the tester cDN A which 
have been ligated with different adaptors. A first round of hybridization serves to 
enrich differentially expressed genes and equalize rare and abundant messages. 
Equalization occurs since reannealing is more rapid for abundant molecules than for 
rarer molecules due to the second order kinetics of hybridization (James and Higgins 
1985). The two primary hybridization mixes are then mixed together in the presence 
of excess driver and allowed to hybridize further. This step permits the annealing of 
single stranded complementary sequences which did not hybridize in the primary 
hybridization, and in doing so generates templates for PCR amplification. Although 
there are several possible combinations of the single stranded molecules present in 
the secondary hybridization mix, only one particular combination (differentially 
expressed in the tester cDNA composed of complimentary strands having different 
adaptors) can amplify exponentially. 

Having obtained the final differential display, two options are available if cloning 
of cDNAs is desired. One is to transform the whole of the final PCR reaction into 
competent cells. Transformed colonies can then be isolated and their inserts 
characterized by sequencing, restriction analysis or PCR. Alternatively, the final 
PCR products can be resolved on a gel and the individual bands excised, reamplified 
and cloned. The first approach is technically simpler and less time consuming. 
However, ligation/transformation reactions are known to be biased towards the 
cloning of smaller molecules, and so the final population of clones will probably not 
contain a representative selection of the larger products. In addition, although 
equalization theoretically occurs, observations in this laboratory suggest that this is 
by no means perfectly accomplished. Consequently, some gene species are present 
in a higher number than others and this will be represented in the final population 
of clones. Thus, in order to obtain a substantial proportion of those gene species that 
actually demonstrate differential expression in the tester population, the number of 
clones that will have to be screened after this step may be substantial. The second 
approach is initially more time consuming and technically demanding. However, it 
would appear to offer better prospects for cloning larger and low abundance gel 
products. In addition, one can incorporate a screening step that differentiates 
different products of different sequences but of the same size (HA-staining, see 
later). In this way, a good idea of the final number of clones to be isolated and 
identified can be achieved. 

An alternative (or even complementary) approach is to use the final differential 
display reaction to screen a cDNA library to isolate full length clones for further 
characterization, or a DNA array (see later) to quickly identify known genes. SSH 
has been used in this laboratory to begin characterization of the short-term gene 
expression profiles of enzyme-inducers such as phenobarbital (Rockett et aL 1997) 
and Wy-14,643 (Rockett et al. unpublished observations). The isolation of 
differentially expressed genes in this manner enables the construction of a fingerprint 
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Figure 5. PCR-select cDNA subtraction. In the primary hybridization, an excess of driver cDNA is 
added to each tester cDNA population. The samples are heat denatured and allowed to hybridize 
for between 3 and 8 h. This serves two purposes : (1 ) to equalize rare and abundant molecules ; and 
(2) to enrich for differentially expressed sequences — cDNAs that are not differentially expressed 
form type c molecules with the driver. In the secondary hybridization, the two primary 
hybridizations are mixed together without denaturing. Fresh denatured driver can also be added 
at this point to allow further enrichment of differentially expressed sequences. Type e molecules 
are formed in this secondary hybridization which are subsequently amplified using two rounds of 
PCR. The final products can be visualized on an agarose gel, labelled directly or cloned into a 
vector for downstream manipulation. As described by Diatchenko et al. (1996) and Gurskaya 
et al. (1996), with permission. 
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Figure 6^ Flow diagram showing method used in this laboratory to isolate and identify clones of genes 
which are differentially expressed in rat liver following short term exposure to the enzyme 
inducers, phenobarbital and Wy-14,643. 

of expressed genes which are unique to each compound and time/dose point. Such 
information could be useful in short-term characterization of the toxic potential of 
new compounds by comparing the gene-expression profiles they elicit with those 
produced by known inducers. Figure 6 shows a flow diagram of the method used to 
isolate, verify and clone differentially expressed genes, and figure 7 shows expression 
profiles obtained from a typical SSH experiment. Subsequent sub-cloning of the 
individual bands, sequencing and gene data base interrogation reveals many genes 
which are either up- or down-regulated by phenobarbital in the rat (tables 2 and 3). 

One of the advantages in using the SSH approach is that no prior knowledge is 
required of which specific genes are up/down-regulated subsequent to xenobiotic 
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Figure 7. SSH display patterns obtained from rat liver following 3-day treatment with WY-14,643 or 
phenobarbital. mRNA extracted from control and treated livers was used to generate the 
differential displays using the PCR-Select cDNA subtraction kit (Clontech). Lane: 1—1 kb 
ladder ; 2— genes upregulated following Wy ,1 4-643 treatment ; 3— genes downregulated following 
Wy,14-643 treatment; 4— genes upregulated following phenobarbital treatment; 5 — genes 
downregulated following phenobarbital treatment; 6— lkb ladder. Reproduced from Rockett et 
aL (1997), with permission. 

exposure, and an almost complete complement of genes are obtained. For example, 
the peroxisome proliferator and non-genotoxic hepatocarcinogen Wy, 14,643, up- 
regulates at least 28 genes and down-regulates at least 15 in the rat (a sensitive 
species) and produces 48 up- and 37 down-regulated genes in the guinea pig, a 
resistant species (Rockett, Swales, Esda and Gibson, unpublished observations). 
One of these genes, CD81, was up-regulated in the rat and down-regulated in the 
guinea pig following Wy-14,643 treatment. CD81 (alternatively named TAPA-1) is 
a widely expressed cell surface protein which is involved in a large number of cellular 
processes including adhesion, activation, proliferation and differentiation (Levy et 
aL 1998). Since all of these functions are altered to some extent in the phenomena 
of hepatomegaly and non-genotoxic hepatocarcinogenesis, it is intriguing, and 
probably mechanistically-relevant, that CD81 expression is differentially regulated 
in a resistant and susceptible species. However, the down-side of this approach is 
that the majority of genes can be sequenced and matched to database sequences, but 
the latter are predominantly expressed sequence tags or genes of completely 
unknown function, thus partially obscuring a realistic overall assessment of the 
critical genes of genuine biological interest. Notwithstanding the lack of complete 
funtional identification of altered gene expression, such gene profiling studies 
essentially provides a * molecular fingerprint* in response to xenobiotic challenge, 
thereby serving as a mechanistically-relevant platform for further detailed 
investigations. 

Differential Display (DD) 

Originally described as 'RNA fingerprinting by arbitrarily primed PCR ' (Liang 
and Pardee 1992) this method is now more commonly referred to as 'differential 
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Table 2. Genes up-regulated in rat liver following 3 -day exposure to phenobarbital. 



Band number 
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Clone 2 75.3% 


CYP2B2 


12 (750) 


93.8% 


TRPM-2 mRNA 




Sulfated glycoprotein 


15 (600) 


92.9% 


Preproalbumin 






Serum albumin mRNA 


16(55) 


Clone 1 95.2% 


CYP2B1 


Clone 2 93.6% 


Haptoglobulin mRNA partial alpha 


21 (350) 


99.3% 


18S,5.8S&28SrRNa 



Bands 1-4, 6, 9, 13, 14, and 17-20 are shown to be false positives by dot blot anaylsis and, therefore, 
are not sequenced. Derived from Rockett et aL (1997). It should be noted that the above genes do not 
represent the complete spectrum of genes which are up-regulated in rat liver by phenobarbital, but 
simply represents the genes sequenced and identified to date. 



Table 3. Genes down-regulated in rat liver following 3-day exposure to phenobarbital. 



Band number 

(approximate Highest sequence 

size in bp) similarity FASTA-EMBL gene identification 



1 (1500) 




95.3% 


3-oxoacyl-CoA thiolase 


2 (1200) 




92.3% 


Hemopoxin mRNA 


3 (1000) 




91.7% 


Alpha-2u-globulin mRNA 


7 (700) 


Clone 1 


77.2% 


M.musctdus CI inhibitor 




Clone 2 


94.5% 


Electron transfer fl a vo protein 




Clone 3 


91.0% 


M. musculus Topoisomerase 1 (Topo 1) 


8 (650) 


Clone 1 


86.9% 


Soares 2NbMT M. musculus (EST) 




Clone 2 


96.2% 


Alpha-2u-globulin (s-type) mRNA 


9 (600) 


Clone 1 


86.9% 


Soares mouse NML M. musculus (EST) 




Clone 2 


82.0% 


Soares p3NMF 19.5 M. musculus (EST) 


10 (550) 




73.8% 


Soares mouse NML M. musculus (EST) 


11 (525) 




95.7% 


NCl-CGAP-Prl H. sapiens (EST) 


12 (375) 




100.0% 


Ribosomal protein 


13 (23) 


Clone 1 


97.2% 


Soares mouse embryo NbME135 (EST) 




Clone 2 


100.0% 


Fibrinogen B-beta-chain 




Clone 3 


100.0% 


Apolipoprotein E gene 


14 (170) 




96.0% 


Soares p3NMF19.5 M. musculus (EST) 


15 (140) 




97.3% 


Stratagene mouse testis (EST) 


Others: (300) 




96.7% 


R. norvegicus RASP 1 mRNA 


(275) 




93.1% 


Soares mouse mammary gland (EST) 



EST = Expressed sequence tag. Bands 4-6 were shown to be false positives by dot blot analysis and, 
therefore, were not sequenced. Derived from Rockett et al. (1 997). It should be noted that the above genes 
do not represent the complete spectrum of genes which are down-regulated in rat liver by phenobarbital, 
but simiply represents the genes sequenced and identified to date. 



display' (DD). In this method, all the mRNA species in the control and treated cell 
populations are amplified in separate reactions using reverse transcriptase-PCR 
(RT-PCR). The products are then run side-by-side on sequencing gels. Those 
bands which are present in one display only, or which are much more intense in one 
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display compared to the other, are differentially expressed and may be recovered for 
further characterization. One advantage of this system is the speed with which it can 
be carried out — 2 days to obtain a display and as little as a week to make and identify 
clones. 

Two commonly used variations are based on different methods of priming the 
reverse transcription step (figure 8). One is to use an oligo dT with a 2-base 'anchor' 
at the 3'-end, e.g. 5' (dT n )CA 3' (Liang and Pardee 1992). Alternatively, an 
arbitrary primer may be used for 1st strand cDNA synthesis (Welsh et al. 1992). 
This variant of RNA fingerprinting has also been called 'RAP' (RNA Arbitrarily 
Primed)-PCR. One advantage of this second approach is that PCR products may be 
derived from anywhere in the RNA, including open reading frames. In addition, it 
can be used for mRNAs that are not polyadenylated, such as many bacterial mRN As 
(Wong and McClelland 1994). In both cases, following reverse transcription and 
denaturation, second strand cDN A synthesis is carried out with an arbitrary primer 
(arbitrary primers have a single base at each position, as compared to random 
primers, which contain a mixture of all four bases at each position). The resulting 
PCR, thus, produces a series of products which, depending on the system (primer 
length and composition, polymerase and gel system), usually includes 50-100. 
products per primer set (Band and Sager 1989). When a combination of different 
dT-anchors and arbitrary primers are used, almost all mRN A species from a cell can 
be amplified. When the cDNA products from two different populations are analysed 
side by side on a polyacrylamide gel, differences in expression can be identified and 
the appropriate bands recovered for cloning and further analysis. 

Although DD is perhaps the most popular approach used today for identifying 
differentially expressed genes, it does suffer from several perceived disadvantages: 

(1) It may have a strong bias towards high copy number mRNAs (Bertioli et al. 
1 995) , although this has been disputed (Wan et al. 1 996) and the isolation of very 
low abundance genes may be achieved in certain circumstances (Guimeraes et 
al. 1995a). 

(2) The cDNAs obtained often only represent the extreme 3' end of the mRN A 
(often the 3 '-untranslated region), although this may not always be the case 
(Guimeraes et al. 1995a). Since the 3 'end is often not included in Genbank and 
shows variation between organisms, cDNAs identified by DD cannot always be 
matched with their genes, even if they have been identified. 

(3) The pattern of differential expression seen on the display often cannot be 
reproduced on Northern blots, with false positives arising in up to 70% of cases 
(Sun et al. 1994). Some adaptations have been shown to reduce false positives, 
including the use of two reverse transcriptases (Sung and Denman 1997), 
comparison of uninduced and induced cells over a time course (Burn et al. 1994) 
and comparison of DDPCR-products from two uninduced and two induced 
lines (Sompayrac et al. 1995). The latter authors also reported that the use of 
cytoplasmic RNA rather then total RNA reduces false positives arising from 
nuclear RNA that is not transported to the cytoplasm. 

Further details of the background, strengths and weaknesses of the DD 
technique can be obtained from a review by McClelland et al. (1996) and from 
articles by Liang et al. (1995) and Wan et al. (1996). 
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Figure 8. Two approaches to differential display (DD) analysis. 1 st strand synthesis can be carried out 
either with a polydT n NN primer (where N = G, C or A) or with an arbitrary primer. The use of 
different combinations of G , C and A to anchor the first strand polydT primer enables the priming 
of the majority of polyadenylated mRNAs. Arbitrary primers may hybridize at none, one or more 
places along the length of the mRNA, allowing 1 st strand cDNA synthesis to occur at none, one 
or more points in the same gene. In both cases, 2 nd strand synthesis is carried out with an arbitrary 
primer. Since these arbitrary primers for the 2 nd strand may also hybridize to the 1 st strand cDNA 
in a number of different places, several different 2 nd strand products may be obtained from one 
binding point of the 1 st strand primer. Following 2 nd strand synthesis, the original set of primers 
is used to amplify the second strand products, with the result that numerous gene sequences are 
amplified. 



Restriction endonuclease-facilitated analysis of gene expression 

Serial Analysis of Gene Expression ( SAGE) 

A more recent development in the field of differential display is SAGE analysis 
(Velculescu et al. 1995). This method uses a different approach to those discussed so 
far and is based on two principles. Firstly, in more than 95% of cases, short 
nucleotide sequences ('tags') of only nine or 10 base pairs provide sufficient 
information to identify their gene of origin. Secondly, concatenation (linking 
together in a series) of these tags allows sequencing of multiple cDNAs within a 
single clone. Figure 9 shows a schematic representation of the SAGE process. In this 
procedure, double stranded cDNA from the test cells is synthesized with a 
biotinylated polydT primer. Following digestion with a commonly cutting (4bp 
recognition sequence) restriction enzyme ('anchoring enzyme'), the 3' ends of the 
cDNA population are captured with streptavidin beads. The captured population is 
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split into two and different adaptors ligated to the 5 'ends of each group. Incorporated 
into the adaptors is a recognition sequence for a type IIS restriction enzyme— one 
which cuts DNA at a defined distance (< 20 bp) from its recognition sequence. 
H ence, following digestion of each captured cDNA population with the IIS enzyme, 
the adaptors plus a short piece of the captured cDNA are released. The two 
populations are then ligated and the products amplified. The amplified products are 
cleaved with the original anchoring enzyme, religated (concatomers are formed in 
the process) and cloned. The advantage of this system is that hundreds of gene tags 
can be identified by sequencing only a few clones. Furthermore, the number of times 
a given transcript is identified is a quantitative measurement of that gene's 
abundance in the original population, a feature which facilitates identification of 
differentially expressed genes in different cell populations. 

Some disadvantages of SAGE analysis include the technical difficulty of the 
method, a large amount of accurate sequencing is required, biased towards abundant 
mRNAs, has not been validated in the pharmaco/toxicogenomic setting and has 
only been used to examine well known tissue differences to date. 

Gene Expression Fingerprinting ( GEF ) 

A different capture/restriction digest approach for isolating differentially 
expressed genes has been described by Ivanova and Belyavsky (1995). In this 
method, RNA is converted to cDNA using biotinylated oligo(dT) primers. The 
cDNA population is then digested with a specific endonuclease and captured with 
magnetic streptavidin microbeads to facilitate removal of the unwanted 5 'digestion 
products. The use of restricted 3 '-ends alone serves to reduce the complexity of the 
cDNA fragment pool and helps to ensure that each RNA species is represented by 
not more than one restriction product. An adaptor is ligated to facilitate subsequent 
amplification of the captured population. PCR is carried out with one adaptor- 
specific and one biotinylated polydT primer. The reamplified population is 
recaptured and the non-biotinylated strands removed by alkaline dissociation. The 
non-biotinylated strand is then resynthesized using a different adaptor-specific 
primer in the presence of a radiolabeled dNTP. The labelled immobilized 3'cDNA 
ends are next sequentially treated with a series of different restriction endonucleases 
and the products from each digestion analysed by PAGE. The result is a fingerprint 
composed of a number of ladders (equal to the number of sequential digests used). 
By comparing test versus control fingerprints, it is possible to identify differentially 
expressed products which can then be isolated from the gel and cloned. The 
advantages of this procedure are that it is very robust and reproducible, and the 
authors estimate that 80-93% of cDNA molecules are involved in the final 
fingerprint. The disadvantage is that polyacrylamide gels can rarely resolve more 
than 300-400 bands, which compares poorly to the 1000 or more which are 
estimated to be produced in an average experiment. The use of 2-D gels such as 
those described by Uitterlinden et al. (1989) and Hatada et al. (1991) may help to 
overcome this problem. 

A similar method for displaying restriction endonuclease fragments was later 
described by Prashar and Weissman (1996). However, instead of sequential 
digestion of the immobolized 3'-terminal cDNA fragments, these authors simply 
compared the profiles of the control and treated populations without further 
manipulation. 
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Figure 9. Serial analysis of gene expression (SAGE) analysis. cDNA is cleaved with an anchoring enzyme 
(AE) and the 3'ends captured using streptavidin beads. The cDNA pool is divided in half and each 
portion ligated to a different linker, each containing a type IIS restriction site (tagging enzyme, 
TE). Restriction with the type IIS enzyme releases the linker plus a short length of cDNA 
(XXXXX and OOOOO indicate nucleotides of different tags). The two pools of tags are then 
ligated and amplified using linker-specific primers. Following PCR, the products are cleaved with 
the AE and the ditags isolated from the linkers using PAGE. The ditags are then ligated (during 
which process, concatenization occurs) and cloned into a vector of choice for sequencing. After 
Velculescu el aL (1995), with permission. 
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DNA arrays 

'Open ' differential display systems are cumbersome in that it takes a great deal 
of time to extract and identify candidate genes and then confirm that they are indeed 
up- or down-regulated in the treated compared to the control tissue. Normally the 
latter process is carried out using Northern blotting or RT-PCR. Even so each of 
the aforementioned steps produce a bottleneck to the ultimate goal of rapid analysis 
of ^ne expression. These problems will likely be addressed by the development of 
so-called DNA arrays (e.g. Gress et al. mi, Zhao et al. 1995, Schena et al. 1996) 
the introduction of which has signalled the next era in differential gene expression 
analysis. DNA arrays consist of a gridded membrane or glass 'chips' contain nR 
hundreds or thousands of DNA «?nnt<: P ,,), rnnc - r , • , pi> . con ^inmg 

, " us OI A s P ots > ea ch consisting of multiple copies of part of 

a known gene. The genes are often selected based on previously proven involvement 
m oncogenesis eel cycling DNA repair, development and other cellular processes 

Suman a nd ml" " * * ", ,pedfic " P ° SSible f ° r *™ and 

Human and mouse arrays are already commercially available and a few companies 

will construct a personalized array to order, for example Clontech Laboratories and 

Research Genetics Inc. The technique is rapid in that hundreds or even thousands 

of genes can be spotted on a single array, and that mRNA/cDNA from the tes 

populations can be labelled and used directly as probe. When analysed wfth 

asTssdW 6 S ° ftWare ' ""^ ° ffer 3 rapid and q-ntitative means to 

assess differences in gene expression between two cell populations. Of course, there 
can only be identification and quantitation of those genes which are in the Trrly 
(hence the term closed' system). Therefore, one approach to elucidating he 
molecular mechanisms involved in a particular disease/development system may be 
to combine an open and closed system-a DNA array to directly identify and 
qu.ntit.te the expression of known genes in mRNA populations^ and an open 
system such as SSH to isolate unknown genes which are differently expressed 

One of the mam advantages of DNA arrays is the huge number of gene fragments 
which can be put on a membrane-some companies have reported gridding "to 
60000 spots on a sing e glass 'chip' (microscope slide). These high densify chip! 
based micro-arrays will probably become available as mass-produced off-the-shelf 
.terns m the near future. This should facilitate the more rapid deter^nation of 
differential expression in time and dose-response experiments. Aside from t^r 
high cost and the technical complexities involved in producing and probing DNA 

LT'h wT T^. Whkh remainS ' ™* the g newe P r micro-ar^ay 

(gene-chip) technologies is that results are often not wholly reproducible between 
arrays. However, this problem is being addressed and should be resolved with n the 
next few years. 1,1 UIC 



EST databases as a means to identify differentially expressed genes 

Expressed sequence tags (ESTs) are partial sequences of clones obtained from 
cDNA hbr.™. Even though most ESTs have no formal identity (putatle 
identification is the best to be hoped for), they have proven to be a rapid and efficient 
means of discovering new genes and can be used to generate profiles of gene- 
expression m specific cells. Since they were first described by Adams et al. (1991) 
there has been a huge explosion in EST production and it is estimated that there are" 
now well over a million such sequences in the public domain, representing over half 
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of all human genes (Hillier et al. 1996). This large number of freely available 
sequences (both sequence information and clones are normally available royalty-free 
from the originators) has enabled the development of a new approach towards 
differential gene expression analysis as described by Vasmatzis et al, (1998). The 
approach is simple in theory: EST databases are first searched for genes that have a 
number of related EST sequences from the target tissue of choice, but none or few 
from non-target tissue libraries. Programmes to assist in the assembly of such sets of 
overlapping data may be developed in-house or obtained privately or from the 
internet. For example, the Institute for Genomic Research (TIGR, found at 
http:/ yAvww.tigr.org) provides many software tools free of charge to the scientific 
community. Included amongst these is the TIGR assembler (Sutton et al. 1995), a 
tool for the assembly of large sets of overlapping data such as ESTs, bacterial 
artificial chromosomes (BAC)s, or small genomes. Candidate EST clones repre- 
senting different genes are then analysed using RNA blot methods for size and tissue 
specificity and, if required, used as probes to isolate and identify the full length 
cDNA clone for further characterization. In practice however, the method is rather 
more involved, requiring bioinformatic and computer analysis coupled with 
confirmatory molecular studies. Vasmatzis et al. (1998) have described several 
problems in this fledgling approach, such as separating highly homologous 
sequences derived from different genes and an overemphasis of specificity for some 
EST sequences. However, since these problems will largely be addressed by the 
development of more suitable computer algorithms and an increased completeness 
of the EST database, it is likely that this approach to identifying differentially 
expressed genes may enjoy more patronage in the future. 



Problems and potential of differential expression techniques 

The holistic or single cell approach ? 

When working with in vivo models of differential expression, one of the first 
issues to consider must be the presence of multiple cell types in any given specimen. 
For example, a liver sample is likely to contain not only hepatocytes, but also 
(potentially) Ito cells, bile ductule cells, endothelial cells, various immune cells (e.g. 
lymphocytes, macrophages and Kupffer cells) and fibroblasts. Other tissues will 
each have their own distinctive cell populations. Also, in the case of neoplastic tissue, 
there are almost always normal, hyperplastic and/or dysplastic cells present in a 
sample. One must, therefore, be aware that genes obtained from a differential 
display experiment performed on an animal tissue model may not necessarily arise 
exclusively from the intended 'target* cells, e.g. hepatocytes/neoplastic cells. If 
appropriate, further analyses using immunohistochemistry, in situ hybridization or 
in situ RT-PCR should be used to confirm which cell types are expressing the 
gene(s) of interest. This problem is probably most acute for those studying the 
differential expression of genes in the development of different cell types, where 
there is a need to examine homologous cell populations. The problem is now being 
addressed atthe National Cancer Institute (Bethesda, MD, USA) where new micro- 
disection techniques have been employed to assist in their gene analysis programme, 
the Cancer Genome Anatomy Project (CGAP) (For more information see web site : 
http : / /vww.ncbi .nlm .nih.gov /ncicgap /intro.html). There are also separation tech- 
niques available that utilise cell-specific antigens as a means to isolate target cells, 
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e.g. fluorescence activated cell sorting (FACS) (Dunbar et al 1998, Kas-Deelen et 
al. 1998) and magnetic bead technology (Richard et al 1998, Rogler et al 1998). 

However, those taking a holistic approach may consider this issue unimportant. 
There is an equally appropriate view that all those genes showing altered expression 
within a compromized tissue should be taken into consideration. After all, since all 
tissues are complex mixes of different, interacting cell types which intimately 
regulate each other's growth and development, it is clear that each cell type could in 
some way contribute (positively or negatively) towards the molecular mechanisms 
which lie behind responses to external stimuli or neoplastic growth. It is perhaps 
then more informative to carry out differential display experiments using in vivo as 
opposed to in vitro models, where uniform populations of identical cells probably 
represent a partial, skewed or even inaccurate picture of the molecular changes that 
occur. 

The incidence and possible implications of inter-individual biological variation 
should be considered in any approach where whole animal models are being used. It 
is clear that individuals (humans and animals) respond in different ways to identical 
stimuli. One of the best characterized examples is the debrisoquine oxidation 
polymorphism, which is mediated by cytochrome CYP2D6 and determines the 
pharmacokinetics of many commonly prescribed drugs (Lennard 1993, Meyer and 
Zanger 1997). The reasons for such differences are varied and complex, but allelic 
variations, regulatory region polymorphisms and even physical and mental health 
can all contribute to observed differences in individual responses. Careful thought 
should, therefore, be given to the specific objectives of the study and to the possible 
value of pooling starting material (tissue/mRNA). The effect of this can be 
beneficial through the ironing out of exaggerated responses and unimportant minor 
fluctuations of (mechanistically) irrelevant genes in individual animals, thus 
providing a clearer overall picture of the general molecular mechanisms of the 
response. However, at the same time such minor variations may be of utmost 
importance in deciding the ability of individual animals to succumb to or resist the 
effects of a given chemical/disease. 



How efficient are differential expression techniques at recovering a high percentage of 
differentially expressed genes? 

A number of groups have produced experimental data suggesting that mam- 
malian cells produce between 8000-1 5 000 different mRN A species at any one time 
(Mechler and Rabbitts 1981, Hedrick et al 1984, Bravo 1990), although figures as 
high as 20-30000 have also been quoted (Axel et al 1976). Hedrick et al (1984) 
provided evidence suggesting that the majority of these belong to the rare abundance 
class. A breakdown of this abundance distribution is shown in table 1. 

When the results of differential display experiments have been compared with 
data obtained previously using other methods, it is apparent that not all differentially 
expressed mRNAs are represented in the final display. In particular, rare messages 
(which, importantly, often include regulatory proteins) are not easily recovered 
using differential display systems. This is a major shortcoming, as the majority of 
mRN A species exist at levels of less than 0.005% of the total population (table 1). 
Bertioli et al (1995) examined the efficiency of DD templates (heterogeneous 
mRNA populations) for recovering rare messages and were unable to detect mRNA 
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species present at less than 1.2% of the total mRNA population — equivalent to an 
intermediate or abundant species. Interestingly, when simple model systems (single 
target only) were used instead of a heterogeneous mRNA population, the same 
primers could detect levels of target mRNA down to 10000X smaller. These results 
are probably best explained by competition for substrates from the many PCR 
products produced in a DD reaction. 

The numbers of differentially expressed mRN As reported in the literature using 
various model systems provides further evidence that many differentially expressed 
mRNAs are not recovered. For example, DeRisi et aL (1997) used DNA array 
technology to examine gene expression in yeast following exhaustion of sugar in the 
medium, and found that more than 1700 genes showed a change in expression of at 
least 2-fold. In light of such a finding, it would not be unreasonable to suggest that 
of the 8000-15 000 different mRNA species produced by any given mammalian cell, 
up to 1000 or more may show altered expression following chemical stimulation. 
Whilst this may be an extreme figure, it is known that at least 100 genes are 
activated/upregulated in Jurkat (T-) cells following IL-2 stimulation (Ullman et aL 
1990). In addition, Wan et aL (1996) estimated that interferon- y-stimulated HeLa 
cells differentially express up to 433 genes (assuming 24000 distinct mRNAs 
expressed by the cells). However, there have been few publications documenting 
anywhere near the recovery of these numbers. For example, in using DD to compare 
normal and regenerating mouse liver, Bauer et aL (1993) found only 70 of 38000 
total bands to be different. Of these, 50% (35 genes) were shown to correspond to 
differentially expressed bands. Chen et aL (1996) reported 10 genes upregulated in 
female rat liver following ethinyl estradiol treatment. McKenzie and Drake (1997) 
identified 14 different gene products whose expression was altered by phorbol 
myristate acetate (PMA, a tumour promoter agent) stimulation of a human 
myelomonocytic cell line. Kilty and Vickers (1997) identified 10 different gene 
products whose expression was upregulated in the peripheral blood leukocytes of 
allergic disease sufferers. Linskens et aL (1995) found 23 genes differentially 
expressed between young and senescent fibroblasts. Techniques other than DD 
have also provided an apparent paucity of differentially expressed genes. Using SH 
for example, Cao et aL (1997) found 15 genes differentially expressed in colorectal 
cancer compared to normal mucosal epithelium. Fitzpatrick et aL (1995) isolated 17 
genes upregulated in rat liver following treatment with the peroxisome proliferator, 
clofibrate; Philips et al. (1990) isolated 12 cDNA clones which were upregulated in 
highly metastatic mammary adenocarcinoma cell lines compared to poorly meta- 
static ones. Prashar and Weissman (1996) used 3" restriction fragment analysis and 
identified approximately 40 genes showing altered expression within 4 h of 
activation of Jurkat T-cells. Groenink and Leegwater (1996) analysed 27 gene 
fragments isolated using SSH of delayed early response phase of liver regeneration 
and found only 12 to be upregulated. 

In the laboratory, SSH was used to isolate up to 70 candidate genes which appear 
to show altered expression in guinea pig liver following short-term treatment with 
the peroxisome proliferator, WY-14,643 (Rockett, Swales, Esdaile and Gibson, 
unpublished observations). However, these findings have still to be confirmed by 
analysis of the extracted tissue mRNA for differential expression of these sequences. 

Whilst the latest differential display technologies are purported to include design 
and experimental modifications to overcome this lack of efficiency (in both the total 
number of differentially expressed genes recovered and the percentage that are true 
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posrtrves), it is still not clear if such adaptations are practically effective-proving 
efficiency by spiking with a known amount of limited numbers of artificial 
construct(s) is one thing, but isolating a high percentage of the rare messages already 
present m an mRNA population is another. Of course, some models will genuinely 
produce only a small number of differentially expressed genes. In addition, there are 
also technical problems that can reduce efficiency. For example, mRNAs may have 
an unusual primary structure that effectively prevents' their amplification by PCR- 
based systems In addition, it is known that under certain circumstances not all 
mRNAs have 3 polyA sites. For example, during Xenopus development, deadenyl- 
ation is used as a means to stabilize RNAs (Voeltz and Steitz 1998) whilst 
preferential deadenylation may play a role in regulating Hsp70 (and perhaps 
therefore, other stress protein) expression in Drosophila (Dtllavallt et al 1994) The 
presence of deadenylated mRNAs would clearly reduce the efficiency of systems 
utilizing a polydT reverse transcription step. The efficiency of any system also 
depends on the quality of the starting material. All differential display techniques 
use mRNA as their target material. However, it is difficult to isolate mRNA that is 
completely free of ri bosom al RNA. Even if polydT primers are used to prime first 
strand cDNA synthesis, ribosomal RNA is often transcribed to some degree 
(Clontech PCR-Select cDNA Subtraction kit user manual). It has been shown at 
least ,n the case of SSH, that a high rRNArmRNA ratio can lead to inefficient 
subtracts hybridization (Clontech PCR-Select cDNA Subtraction kit user 
manual) and there is no reason to suppose that it will not do likewise in other SH 
approaches Finally, those techniques that utilise a presubtraction amplification step 
(e.g. RDA) may present a skewed representation since some sequences amplify 
better than others. K ' 

Of course, probably the most important consideration is the temporal factor It 
is clear that any given differential display experiment can only interrogate a cell at 
one point in time. It may well be that a high percentage of the genes showing altered 
expression at that time are obtained. However, given that disease processes and 
responses to environmental stimuli involve dynamic cascades of signalling 
regulation production and action, it is clear that all those genes which are switched 
on/off at different times will not be recovered and, therefore, vital information may 
well be missed. It is, therefore, imperative to obtain as much information about the 
model system beforehand as possible, from which a strategy can be derived for 
targeting specific t,me points or events that are of particular interest to the 
investigator. One way of getting round this problem of single time point analysis is 
to conduct the experiment over a suitable time course which, of course, adds 
substantially to the amount of work involved. 



How sensitive are differential expression technologies? 

There has been little published data that addresses the issue of how large the 
change ,n expression must be for it to permit isolation of the gene in question with 
the various differential expression technologies. Although the isolation of genes 
whose expression ,s changed as little as 1.5-fold has been reported using SSH 
(Groenmk and Leegwater 1996), it appears that those demonstrating a change in 
excess of 5-fold are more likely to be picked up. Thus, there is a 'grey zone' 
in between where small changes could fade in and out of isolation between 
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experiments and animals. DD, on the other hand, is not subject to this grey 
zone since, unlike SH approaches, it does not amplify the difference in expression 
between two samples. Wan et aL (1996) reported that differences in expression of 
twofold or more are detectable using DD. 

Resolution and visualization of differential expression products 

It seems highly improbable with current technology that a gel system could be 
developed that is able to resolve all gene species showing altered expression in any 
given test system (be it SH- or DD-based). Polyacrylamide gel electrophoresis 
(PAGE) can resolve size differences down to 0.2% (Sambrook et aL 1989) and are 
used as standard in DD experiments. Even so, it is clear that a complex series of gene 
products such as those seen in a DD will contain unresolvable components. Thus, 
what appears to be one band in a gel may in fact turn out to be several. Indeed, it has 
been well documented (Mathieu-Daude et aL 1996, Smith et aL 1997) that a single 
band extracted from a DD often represents a composite of heterogeneous products, 
and the same has been found for SSH displays in this laboratory (Rockett et aL 
1997). One possible solution was offered by Mathieu-Daude et aL (1996), who 
extracted and reamplified candidate bands from a DD display and used single strand 
conformation polymorphism (SSCP) analysis to confirm which components 
represented the truly differentially expressed product. 

Many scientists often try to avoid the use of PAGE where possible because it is 
technically more demanding than agarose gel electrophoresis (AGE). Unfortunately, 
high resolution agarose gels such as Metaphor (FMC, Lichfield, UK) and AquaPor 
HR (National Diagnostics, Hessle, UK), whilst easier to prepare and manipulate 
than PAGE, can only separate DNA sequences which differ in size by around 
1.5-2% (15-20 base pairs for a 1Kb fragment). Thus, SSH, RDA or other such 
products which differ in size by less than this amount are normally not resolvable. 
However, a simple technique does in fact exist for increasing the resolving power of 
AGE — the inclusion of HA-red (10-phenyl neutral red-PEG ligand) or HA-yellow 
(bisbenzamide-PEG ligand) (Hanse Analytik GmbH, Bremen, Germany) in a 
gel separates identical or closely sized products on base content. Specifically, 
HA-red and -yellow selectively bind to GC and AT DNA motifs, respectively 
(Wawer et aL 1995, Hanse Analytik 1997, personal communication). Since both 
HA-stains possess an overall positive charge, they migrate towards the cathode 
when an electric field is applied. This is in direct opposition to DNA, which 
is negatively charged and, therefore, migrates towards the anode. Thus, if two 
DNA clones are identical in size (as perceived on a standard high resolution 
agarose gel), but differ in AT/GC content, inclusion of a HA-dye in the gel 
will effectively retard the migration of one of the sequences compared to the 
other, eifectively making it apparently larger and, thus, providing a means of 
differentiating between the two. The use of HA-red has been shown to resolve 
sequences with an AT variation of less than 1 % (Wawer et aL 1995), whilst Hanse 
Analytik have reported that HA staining is so sensitive that in one case it was used 
to distinguish two 567bp sequences which differed by only a single point mutation 
(Hanse Analytik 1 996, personal communication). Therefore, if one wishes to check 
whether all the clones produced from a specific band in a differential display 
experiment are derived from the same gene species, a small amount of reamplified 
or digested clone can be run on a standard high resolution gel, and a second aliquot 
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Figure 10. Discrimination of clones of identical/nearly identical size using HA-red. Bands of decreasing 
size (1-5) were extracted from the final display of a suppression subtractive hybridization 
experiment and cloned. Seven colonies were picked at random from each cloned band and their 
inserts amplified using PCR. The products were run on two gels, (A) a high resolution 2 % agarose 
gel, and (B) a high resolution 2% agarose gel containing 1 U/ml HA-red. With few exceptions, all 
the clones from each band appear to be the same size (gel A). However, the presence of HA-red 
(gel B), which separates identically-sized DNA fragments based on the percentage of GC within 
the sequence, clearly indicates the presence of different gene species within each band. For 
example, even though all five re-amplified clones of band 1 appear to be the same size, at least four 
different gene species are represented. 

in a similar gel containing one of the HA-stains. The standard gel should indicate 
any gross size differences, whilst the HA-stained gel should separate otherwise 
unresolvable species (on standard AGE) according to their base content. Geisinger 
et al. (1997) reported successful use of this approach for identifying DD-derived 
clones. Figure 10 shows such an experiment carried out in this laboratory on clones 
obtained from a band extracted from an SSH display. 

An alternative approach is to carry out a 2-D analysis of the differential display 
products. In this approach, size-based separation is first carried out in a standard 
agarose gel. The gel slice containing the display is then extracted and incorporated 
in to a HA gel for resolution based on AT/GC content. 

Of course, one should always consider the possibility of there being different 
gene species which are the same size and have the same GC /AT content. However, 
even these species are not unresolvable given some effort — again, one might use 
SSCP, or perhaps a denaturing gradient gel electrophoresis (DGGE) or temperature 
gradient field electrophoresis (TGGE) approach to resolve the contents of a band, 
either directly on the extracted band (Suzuki et al. 1991) or on the reamplified 
product. 

The requirement of some differential display techniques to visualize large 
numbers of products (e.g. DD and GEF) can also present a problem in that, in terms 
of numbers, the resolution of PAGE rarely exceeds 300-400 bands. One approach to 
overcoming this might be to use 2-D gels such as those described by Uitterlinden et 
al. (1989) and Hatada et al. (1991). 
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Extraction of differentially expressed bands from a gel can be complex since, in 
some cases (e.g. DD, GEF), the results are visualized by autoradiographic means, 
such that precise overlay of the developed film on the gel must occur if the correct 
band is to be extracted for further analysis. Clearly, a misjudged extraction can 
account for many man-hours lost. This problem, and that of the use of radioisotopes, 
has been addressed by several groups. For example, Lohmann et ah (1995) 
demonstrated that silver staining can be used directly to visualize DD bands in 
horizontal PAGs. An et al. (1996) avoided the use of radioisotopes by transferring a 
small amount (20-30%) of the DNA from their DD to a nylon membrane, and 
visualizing the bands using chemiluminescent staining before going back to extract 
the remaining DNA from the gel. Chen and Peck (1996) went one step further and 
transferred the entire DD to a nylon membrane. The DNA bands were then 
visualized using a digoxigenin (DIG) system (DIG was attached to the polydT 
primers used in the differential display procedure). Differentially expressed bands 
were cut from the membrane and the DNA eluted by washing with PCR buffer prior 
to reamplification. 

One of the advantages of using techniques such as SSH and RD A is that the final 
display can be run on an agarose gel and the bands visualized with simple ethidium 
bromide staining. Whilst this approach can provide acceptable results, overstaining 
with SYBR Green I or SYBR Gold nucleic acid stains (FMC) effectively enhances 
the intensity and sharpness of the bands. This greatly aids in their precise extraction 
and often reveals some faint products that may otherwise be overlooked. Whilst 
differential displays stained with SYBR Green I are better visualized using short 
wavelength UV (254 nm) rather than medium wavelength (306 nm), the shorter 
wavelength is much more DNA damaging. In practice, it takes only a few seconds 
to damage DNA extracted under 254 nm irradiation, effectively preventing 
reamplification and cloning. The best approach is to overstain with SYBR Green I 
and extract bands under a medium wavelength UV transillumination. 

The possible use of < microfingerprinting > to reduce complexity 

Given the sheer number of gene products and the possible complexity of each 
band, an alternative approach to rapid characterization may be to use an enhanced 
analysis of a small section of a differential display — a 'sub-fingerprint* or ' micro- 
fingerprint \ In this case, one could concentrate on those bands which only appear 
in a particular chosen size region. Reducing the fingerprint in this way has at least 
two advantages. One is that it should be possible to use different gel types, 
concentrations and run times tailored exactly to that region. Currently, one might 
run products from 100-3000 + bp on the same gel, which leads to compromize in the 
gel system being used and consequently to suboptimal resolution, both in terms of 
size and numbers, and can lead to problems in the accurate excision of individual 
bands. Secondly, it may be possible to enhance resolution by using a 2-D analysis 
using a HA-stain, as described earlier. In summary, if a range of gene product sizes 
is carefully chosen to included certain ' relevant* genes, the 2-D system standardized, 
and appropriate gene analysis used, it may be possible to develop a method for the 
early and rapid identification of compounds which have similar or widely different 
cellular effects. If the prognosis for exposure to one or more other chemicals which 
display a similar profile is already known, then one could perhaps predict similar 
effects for any new compounds which show a similar micro-fingerprint. 
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An alternative approach to microfingerprinting is to examine altered expression 
in specific families of genes through careful selection of PCR primers and/or post- 
reaction analysis. Stress genes, growth factors and/or their receptors, cell cycling 
genes, cytochromes P450 and regulatory proteins might be considered as candidates 
for analysis in this way. Indeed, some off-the-shelf DNA arrays (e.g. Clontech's 
Atlas cDNA Expression Array series) already anticipated this to some degree by 
grouping together genes involved in different responses e.g. apoptosis, stress, DNA- 
damage response etc. 



Screening 
False positives 

The generation of false positives has been discussed at length amongst the 
differential display community (Liang et al. 1993, 1995, Nishio et al. 1994, Sun et al. 
1994, Sompayrac et al. 1995). The reason for false positives varies with the 
technique being used. For instance, in RDA, the use of adaptors which have not 
been HPLC purified can lead to the production of false positives through illegitimate 
ligation events (O'Neill and Sinclair 1997), whilst in DD they can arise through 
PCR artifacts and illegitemate transcription of rRNA. In SH, false positives appear 
to be derived largely from abundant gene species, although some may arise from 
cDNA/mRNA species which do not undergo hybridization for technical reasons. 

A quick screening of putative differentially expressed clones can be carried out 
using a simple dot blot approach, in which labelled first strand probes synthesized 
from tester and driver mRNA are hybridized to an array of said clones (Hedrick et 
al. 1984, Sakaguchi et al. 1986). Differentially expressed clones will hybridize to 
tester probe, but not driver. The disadvantage of this approach is that rare species 
may not generate detectable hybridization signals. One option for those using SSH 
is to screen the clones using a labelled probe generated from the subtracted cDNA 
from which it was derived, and with a probe made from the reverse subtraction 
reaction (ClonTechniques 1997a). Since the SSH method enriches rare sequences, 
it should be possible to confirm the presence of clones representing low abundance 
genes. Despite this quick screening step, there is still the need to go back to the 
original mRNA and confirm the altered expression using a more quantitative 
approach. Although this may be achieved using Northern blots, the sensitivity is 
poor by today's high standards and one must rely on PCR methods for accurate and 
sensitive determinations (see below). 



Sequence analysis 

The majority of differential display procedures produce final products which are 
between 100 and lOOObp in size. However, this may considerably reduce the size of 
the sequence for analysis of the DNA databases. This in turn leads to a reduced 
confidence in the result — several families of genes have members whose DNA 
sequences are almost identical except in a few key stretches, e.g. the cytochrome 
P450 gene superfamily (Nelson et al. 1996). Thus, does the clone identified as being 
almost identical to gene X 0 really come from that gene, or its brother gene X, or its 
as yet undiscovered sister X 2 ? For example, using SSH, part of a gene was isolated, 
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which was up-regulated in the liver of rats exposed to Wy-14,643 and was identified 
by a FASTA search as being transferrin (data not shown). However, transferrin is 
known to be downregulated by hypolipidemic peroxisome proliferators such as Wy- 
14,643 (Hertz et al. 1996), and this was confirmed with subsequent RT-PCR 
analysis. This suggests that the gene sequence isolated may belong to a gene which 
is closely related to transferrin, but is regulated by a different mechanism. 

A further problem associated with SH technology is redundancy. In most cases 
before SH is carried out, the cDNA population must first be simplified by restriction 
digestion. This is important for at least two reasons: 

(1) To reduce complexity— long cDNA fragments may form complex networks 
which prevent the formation of appropriate hybrids, especially at the high 
concentrations required for efficient hybridization. 

(2) Cutting the cDNAs into small fragments provides better representation of 
individual genes. This is because genes derived from related but distinct 
members of gene families often have similar coding sequences that may cross- 
hybridize and be eliminated during the subtraction procedure (Ko 1990). 
Furthermore, different fragments from the same cDNA may differ considerably 
in terms of hybridization and amplification and, thus, may not efficiently do one 
or the other (Wang and Brown 1991). Thus, some fragments from differentially 
expressed cDNAs may be eliminated during subtractive hybridization pro- 
cedures. However, other fragments may be enriched and isolated. As a 
consequence of this, some genes will be cut one or more times, giving rise to two 
or more fragments of different sizes. If those same genes are differentially 
expressed, then two or more of the different size fragments may come through 
as separate bands on the final differential display, increasing the observed 
redundancy and increasing the number of redundant sequencing reactions. 

Sequence comparisons also throw up another important point— at what degree 
of sequence similarity does one accept a result. Is 90% identitiy between a gene 
derived from your model species and another acceptably close? Is 95% "between 
your sequence and one from the same species also acceptable? This problem is 
particularly relevant when the forward and reverse sequence comparisons give 
similar sequences with completely different gene sp.ecies! An arbitrary decision 
seems to be to allocate genes that are definite (95% and above similarity) and then 
group those between 60 and 95% as being related or possible homologues. 

Quantitative analysis 

At some point, one must give consideration to the quantitative analysis of the 
candidate genes, either as a means of confirming that they are truly differentially 
expressed, or in order to establish just what the differences are. Northern blot 
analysis is a popular approach as it is relatively easy and quick to perform. However, 
the major drawback with Northern blots is that they are often not sensitive enough 
to detect rare sequences. Since the majority of messages expressed in a cell are of low 
abundance (see table 1), this is a major problem. Consequently, RT-PCR may be the 
method of choice for confirming differential expression. Although the procedure is 
somewhat more complex than Northern analysis, requiring synthesis of primers and 
optimization of reaction conditions for each gene species, it is now possible to set up 
high throughput PCR systems using mulitchannel pipettes, 96 +-well plates and 
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carcinogenic effect. Whilst differential display technology cannot hope to answer 
these questions, it does provide a springboard from which identification, regulatory 
and functional studies can be launched. Understanding the molecular mechanism of 
cellular responses is almost impossible without knowing the regulation and function 
of those genes and their condition (e.g. mutated). In an abstract sense, differential 
display can be likened to a still photograph, showing details of a fixed moment in 
time. Consider the Historian who knows the outcome of a battle and the placement 
and condition of the troops before the battle commenced, but is asked to try and 
deduce how the battle progressed and why it ended as it did from a few still 
photographs— an impossible task. In order to understand the battle, the Historian 
must find out the capabilities and motivation of the soldiers and their commanding 
officers, what the orders were and whether they were obeyed. He must examine the 
terrain, the remains of the battle and consider the effects the prevailing weather 
conditions exerted. Likewise, if mechanistic answers are to be forthcoming, the 
scientist must use differential display in combination with other techniques such as 
knockout technology, the analysis of cell signalling pathways, mutation analysis and 
time and dose response analyses. Although this review has emphasized the 
importance of diff erential gene profiling, it should not be considered in isolation and 
the full impact of this approach will be strengthened if used in combination with 
functional genomics and proteomics (2-dimensional protein gels from isoelectric 
focusing and subsequent SDS electrophoresis and virtual 2D-maps using capillary 
electrophoresis). Proteomics is attracting much recent attention as many of the 
changes resulting in differential gene expression do not involve changes in mRNA 
levels, as decribed extensively herein, but rather protein-protein, protein-DNA and 
protein phosphorylation events which would require functional genomics or 
proteomic technologies for investigation. 

Despite the limitations of differential display technology, it is clear that many 
potential applications and benefits can be obtained from characterizing the genetic 
changes that occur in a cell during normal and disease development and in response 
to chemical or biological insult. In light of functional data, such profiling will 
provide a 'fingerprint' of each stage of development or response, and in the long 
term should help in the elucidation of specific and sensitive biomarkers for different 
types of chemical/biological exposure and disease states. The potential medical and 
therapeutic benefits of understanding such molecular changes are almost im- 
measurable. Amongst other things, such fingerprints could indicate the family or 
even specific type of chemical an individual has been exposed to plus the length 
and/or acuteness of that exposure, thus indicating the most prudent treatment. 
They may also help uncover differences in histologically identical cancers, provide 
diagnostic tests for the earliest stages of neoplasia and, again, perhaps indicate the 
most efficacious treatment. 

The Human Genome Project will be completed early in the next century and the 
DNA sequence of all the human genes will be known. The continuing development 
and evolution of differential gene expression technology will ensure that this 
knowledge contributes fully to the understanding of human disease processes. 
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SUMMARY 



The technique of differential display reverse transcription-polymerase chain reaction (ddRT-PCR) has been used to produce unique 
profiles of up-regulated and down-regulated gene expression in the liver of male Wistar rats following short term exposure to the 
non-genotoxic hepatocarcinogens, phenobarbital and WY- 14,643. Animals were treated for 3 days, whereupon their livers were 
extracted and snap frozen. mRNA was prepared from the livers and used for ddRT-PCR. Individual bands from the differential 
displays were extracted and cloned. False positives were eliminated by dotblot screening and true positives then sequenced and 
identified. 



INTRODUCTION 

Safety evaluation of^new chemicals usually necessi- 
tates the examination of genotbxic and carcinogenic 
potential using short-term in vitro and in vivo geno- 
toxicity assays augmented by chronic bioassay tests. 
The short-term assays have proved useful in the early 
identification of potential genotoxic carcinogens, but 
their value is limited by observations which suggest 
that approximately 60% of chemicals identified as car- 
cinogens in life-exposure studies produce mainly 
negative findings in short-term genotoxicity tests (1,2). 
Thus, there is currently no reliable and rapid means of 
evaluating the carcinogenic risk of new chemicals 
which fall into this latter group of compounds, termed 
non-genotoxic (or epigenetic) carcinogens. 



Please send reprint requests to : Dr John Rockett, Molecular 
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It is now evident that non-genotoxic carcinogens 
constitute a group of chemicals which are not only di- 
vergent in their interspecies toxicity, but also demon- 
strate different target organ selectivities and mecha- 
nisms of action (3,4). Elucidation of the molecular 
mechanisms underlying non-genotoxic carcinogenesis 
is currently underway, but the picture is still far from 
complete. It is anticipated that a better understanding 
of the early changes in genetic expression following 
exposure to non-genotoxic carcinogens will aid devel- 
opment of experimental strategies to identify cellular 
markers which are diagnostic for this type of toxicity. 

Subtractive ddRT-PCR is a recently developed 
technique which facilitates the preferential amplifica- 
tion of gene products that demonstrate altered expres- 
sion in target tissue(s) following exposure to chemical 
stimuli. Furthermore, using this technique, no prior 
knowledge of the specific genes which are up/down 
regulated is required. In the current study, we have un- 
dertaken to develop a specific and rapid assay for non- 
genotoxic carcinogens using the technique of ddRT- 
PCR. This has allowed us to identify characteristic 
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patterns of gene regulation following administration of 
two different non-genotoxic carcinogens (phenobarbi- 
tal and Wy-14,643) and the subsequent identification 
of individual gene species which are regulated by this 
xenobiotic treatment. 



MATERIALS AND METHODS 
Animals and treatment 

Phenobarbital (BDH, Poole, UK; 100 mg/kg/day) or 
[4-chloro-6-(2,3-xylidino)-2-pyrimidinylthio] acetic 
acid (Wy-14,643) (Campo, Emmerich; 250 
mg/kg/day) was administered by gavage to groups of 
3 male Wistar rats (150-200 g) on three consecutive 
days, whilst control animals received nothing. All ani- 
mals had free access to food (rat and mouse standard 
diet, B&K Universal, Hull, UK) and water. The ani- 
mals were killed on the fourth day, whereupon their 
livers were excised, sliced into 0.5 cm cubes, snap fro- 
zen in liquid nitrogen and then stored at -70°C 

mRNA extraction 

Up to 0.25 g of each frozen liver sample was ground 
under liquid nitrogen using a mortar and pestle. 
mRNA was extracted from the ground liver using 
Promega's PolyATtract® System 1000 (Promega, 
Madison, WI, USA) according to the technical man- 
ual. The mRNA was DNase-treated (Promega, final 
concentration 10 U/ml) before phenol/chloroform ex- 
traction and ethanol precipitation. The mRNA was re- 
suspended at a final concentration 500-1000 ng/fil. 

.ddRT-PCR 

This was carried out using the PCR-Select™ cDNA 
Subtraction Kit (Clontech, Palo Alto, CA, USA) ac- 
cording to the manufacturer's instructions. Final PCR 
reactions were run on a 2% Metaphor agarose (FMC, 
Rockland, MD, USA) gel containing ethidium bro- 
mide (Sigma, Dorset, UK) and then overstained for 30 
min with SYBR Green I DNA stain (FMC, 1:10 000 
dilution in TAE). 



Bamdl extraction and cloning 

Each discernible band from the differential display 
pattern was extracted from the gel with a scalpel and 



the DNA eluted using a Genelute™ Agarose Spin Col- 
umn (Supelco, Bellefonte). An aliquot of the eluted 
DNA (5 ^1) was re-amplified using the original ddRT- 
PCR nested primers and electrophoresed on a 2% 
agarose gel. The re-amplified band was extracted from 
the gel (as above) and the eluted DNA ligated directly 
into the TOPO TA Cloning® vector (Invitrogen, 
Carlsbad) before transformation in Escherichia coli 
TOP10F One Shot™ cells (Invitrogen). 

Stage 1 screening 

Twelve transformed (white) colonies from each band 
were grown up for 6 h in 200 |xl LB broth containing 
ampicillin (Sigma, 50 ^g/ml) and 1 \i\ of this ampli- 
fied by PCR reaction (as specified in ddRT-PCR tech- 
nical manual). One quarter of the completed reaction 
was electrophoresed on a standard 2% agarose gel and 
one quarter on a 2% agarose gel containing HA Yel- 
low (Hanse Analytik GmbH, Bremen, Germany, 1 
U/fil) to discern the different cloning products. The re- 
mainder was used to prepare duplicate dotblots on Hy- 
bond N+ (nylon) membranes (Amersham, Little Chal- 
font, UK). Cultures containing different cloning prod- 
ucts were grown up and a plasmid miniprep prepared 
from each (Wizard Plus SV Minipreps DNA Purifica- 
tion System, Promega) according to the manufac- 
turer's instructions. 



Stage II screening 

The duplicate dotblots were probed with: (a) the final 
differential, display reaction; and (b) the 'reverse-sub- 
tracted ' differential display reaction. To make the 're- 
verse-subtracted' probe, the subtractive hybridisation 
step of the ddRT-PCR procedure was carried out using 
the original tester cDNA as a driver and the driver as 
a tester. Probing and visualisation were carried out us- 
ing the ECL Direct Nucleic Acid Labelling and Detec- 
tion System (Amersham) according to the manufac- 
turer's instructions. Those clones which were positive 
for (a) but negative for (b), or showed a t substantially 
larger positive signal with (a) compared to (b), were 
chosen for further analysis. 



DNA sequencing 

Positive clones as identified above were sequenced on 
an automated ABI DNA sequencer (Applied Biosys- 
tems, Warrington, UK). 
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Fig. 1 : (A) Subtractive ddRT-PCR patterns obtained from rat liver following 3-day treatment with WY-14,643 or phenobarbital. Lane 
1, 1 kb ladder, lane 2, genes up-regulated following Wy, 14-643 treatment; lane 3, genes down-regulated following 
Wy, 14-643 treatment; lane 4, genes up-regulated following phenobarbital treatment; lane 5, genes down-regulated following 
phenobarbital treatment; and lane 6, lkb ladder. (B) Subtractive ddRT-PCR patterns obtained from rat liver showing relative 
changes when phenobarbital treated rnRN A is subtracted from Wy-14,643-treated mRNA and vice-versa. Lane 1, 1 kb 
ladder, lane 2, genes showing increased expression following Wy- 14,643 treatment compared to phenobarbital treatment; 
lane 3, genes showing increased expression following phenobarbital treatment compared to Wy- 14,643 treatment. See 
Materials and Methods for further details. 




Fig. 2 : Re-amplified ddRT-PCR products which were down-regulated following phenobarbital treatment (upregulated bands were also 
re-amplified but gel not shown). Individual DNA bands excised from gel of ddRTR-PCR reactions were extracted, 
re-amplified and run on agarose gels to confirm amplification of correct band (numbered). See Materials and Methods for 
further details. 
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Table I : Rat liver genes down-regulated by phenobarbital treatment 



Band number (Fig. 2) Phenobarbital down-regulated 

(Approximate size in bp) Highest sequence homology FASTA-EMBL gene identification 



1 (1500) 




95.3% 


Rat mRNA for 3-oxoacyl-CoA thiolase 


2 (1200) 




92.3% 


Rat hemopoxin mRNA 


3 (1000) 




91.7% 


ft rattus alpha-2u-globulin mRNA 


7(700) 


Clone 1 


77.2% 


M. musculus mRNA for CI inhibitor 




Clone 2 


94.5% 


Rat electron transfer flavoprotein 




Clone 3 


91.0% 


Mouse topoisomerase 1 (Topo 1) mRNA 


8(650) 


Clone 1 


86.9% 


Soares 2NbMT M. musculus (EST) 




Clone 2 


96.2% 


Rat alpha-2u-globulin (s-type) mRNA 


9(600) 


Clone 1 


86.9% 


Soares mouse NML M. musculus (EST) 




Clone 2 


82.0% 


Soares p3NMF19.5 M, musculus (EST) 


10(550) 




73.8% 


Soares mouse NML M. musculus (EST) 


1 1 [O^O ) 




OC TO/ 

95.7% 


NCI_CGAP_Pr1 H. sapiens (EST) 


12 (375) 




100.0% 


ft norvegicus mRNA for ribosomal protein 


13 (230) 


Clone 1 


97.2% 


Soares mouse embryo NbME135 (EST) 




Clone 2 


100.0% 


Rat fibrinogen B-beta-chain 




Clone 3 


100.0% 


Rat apolipoprotein E gene 


14 (170) 




96.0% 


Soares p3NMF19.5 M. musculus (EST) 


15 (140) 




97.3% 


Stratagene mouse testis (EST) 


Others: (300) 




96.7% 


ft norvegicus RASP 1 mRNA 


(275) 




93.1% 


Soares mouse mammary gland (EST) 



EST = expressed sequence tag. 
Bands 4-6 were shown to be false positives by dotblot analysis and, therefore, not sequenced. 



Table II : Rat liver genes up-regulated by phenobarbital treatment 



Band number Phenobarbital up-regulated 

(Approximate size in bp) Highest sequence homology FASTA-EMBL gene identification 



5 (1300) 




93.5% 


Rat cytochrome P450IIB1 


7(1000) 




95.1% 


mRNA for rat preproalbumin 
Rat serum albumin mRNA 


8 (950) 




98.3% 


NCLCGAP_Pr1 H. sapiens (EST) 


10(850) 




95.7% 


Rat cytochrome P450IIB1 


11 (800) 


Clone 1 


94.9% 


Rat cytochrome P450IIB1 




Clone 2 


75.3% 


Rat cytochrome p450-L (p450IIB2) 


12 (750) 




93.8% 


Rat TRPM-2 mRNA 

Rat mRNA for sulfated glycoprotein 


15 (600) 




92.9% 


mRNA for rat preproalbumin 
Rat serum albumin mRNA 


16(550) 


Clone 1 


95.2% 


Rat cytochrome P450IIB1 




Clone 2 


93.6% 


Rat haptoglobin mRNA partial alpha 


21 (350) 




99.3% 


ft norvegicus genes for 18S, 5.8S & 28S rRNA 



EST = expressed sequence tag. 
Bands 1-4, 6, 9, 13, 14 and 17-20 shown to be false positives by dotblot analysis and, therefore, not sequenced. 
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Identification of differentially-regulated 
genes 

* Gene-sequences were identified using the FASTA pro- 
gramme (http://www.ebi.ac.uk/htbin/fasta.py7request) 
to search all EMBL databases for matching DNA se- 
quences. 

RESULTS 

Figure 1A,B shows the ddRT-PCR patterns of genes 
showing altered expression in rat liver following 3 day 
treatment with phenobarbital or Wy-14,643. Individual 
bands were isolated from the phenobarbitai-modulated 
patterns (both up- and down-regulated), re-amplified 
(Fig. 2), cloned, screened for false positives and then 
identified. Those xenobiotic-modulated gene products 
identified to date are listed in Tables I and II. 



DISCUSSION 

The advent of combinatorial chemistry has led to the 
synthesis of millions of new chemical compounds, 
many of which may be potentially useful in pharma- 
ceutical, agricultural or industrial applications. How- 
ever, whilst there are tests available for those posing a 
genotoxic activity, there remains no short-term assay 
able to identify those chemicals which may belong to 
the non-genotoxic group of carcinogens. 

We have used an adaptation of the subtractive hy- 
bridisation method - ddRT-PCR - to produce charac- 
teristic profiles or 'fingerprints' of those genes which 
are up-regulated or down-regulated in male rat liver 
following acute exposure to test chemicals. The ddRT- 
PCR profiles are characteristic and unique for each of 
the 2 compounds studied to date. 

A number of those gene species showing altered 
expression following phenobarbital treatment have 
been cloned and identified (Tables I & II). It is inter- 
esting to note the presence of CYP2B2 in the up-regu- 
lated genes. This would, of course, be expected fol- 
lowing exposure to phenobarbital and serves as a posi- 
tive control for the method. Other genes which one 
might normally expect to be up-regulated do not ap- 
pear in the table. However, it should be noted that not 



all bands seen on the differential display were ex- 
tracted and re-amplified due to their being too faint or 
too close to other bands to accurately excise. Further- 
more, it has been well documented [(5) and references 
therein] that a single band extracted from a differential 
display often represents a composite of heterogeneous 
products. We are currently examining new methods to: 
(i) improve resolution of the differential display pat- 
terns (including 2-D agarose gels); and (ii) distinguish 
those ddRT-PCR products which are identical in size, 
but different in sequence. 

Our future efforts will be directed towards deter- 
mining the extent of modulation of a number of the 
genes reported herein using semi-quantitative RT- 
PCR. This should reveal the extent of changes in ex- 
pression of key gene products which may be involved 
in non-genotoxic hepatocarcinogenesis and thus help 
increase understanding of this process. Furthermore, it 
is anticipated that aligning ddRT-PCR profiles of dif- 
ferent non-genotoxic agents found in responsive and 
non-responsive species may enable identification of 
those genes which are mechanistically relevant to the 
non-genotoxic hepatocarcinogenic process. Accord- 
ingly, this approach lends itself well to the identifica- 
tion, characterisation and sub-classification of possible 
different classes of non-genotoxic carcinogens. 
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Abstract 

Understanding the genetic profile of a cell at all stages of normal and carcinogenic development should provide an 
essential aid to developing new strategies for the prevention, early detection, diagnosis and treatment of cancers. We 
have attempted to identify some of the genes that may be involved in peroxisome-proliferator (PP)-induced 
non-genotoxic hepatocarcinogenesis using suppression PCR subtractive hybridisation (SSH). Wistar rats (male) were 
chosen as a representative susceptible species and Duncan -Hartley guinea pigs (male) as a resistant species to the 
hepatocarcinogenic effects of the PP, [4-chloro-6-(2,3-xylidino)-2-pyrimidinylthio] acetic acid (Wy-14,643). In each 
case, groups of four test animals were administered a single dose of Wy-14,643 (250 mg/kg per day in corn oil) by 
gastric intubation for 3 consecutive days. The control animals received corn oil only. On the fourth day the animals 
were killed and liver mRNA extracted. SSH was carried out using mRNA extracted from the rat and guinea pig 
livers, and used to isolate genes that were up and downregulated following Wy-14,643 treatment. These genes 
included some predictable (and hence positive control) species such as CYP4A1 and CYP2C11 (upregulated and 
downregulated in rat liver, respectively). Several genes that may be implicated in hepatocarcinogenesis have also been 
identified, as have some unidentified species. This work thus provides a starting point for developing a molecular 
profile of the early effects of a non-genotoxic carcinogen in sensitive and resistant species that could ultimately lead 
to a short-term assay for this type of toxicity. © 2000 Elsevier Science Ireland Ltd. All rights reserved. . 
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RT-PCR; Rat; Guinea pig; Gene regulation; Differential gene display; Gene profiling 
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Introduction 

The advent of combinatorial chemistry and 
omputer-aided drug design has led to a recent 
psurge in the number of chemical compounds 
lat have potential therapeutic, agricultural and 
idustrial applications. Although it has been sug- 
ssted that the contribution of synthetic chemicals 
> the overall incidence of human cancer is low, 
lere still remains an absolute requirement to 
/aluate all new chemicals for toxic and carcino- 
mic potential. The latter is one of the most 
roblematic areas of chemical safety evaluation 
id is usually carried out using short-term in vitro 
id in vivo genotoxicity assays augmented by 
ironic bioassay tests. The short-term assays have 
roved useful in the early identification of poten- 
al genotoxic carcinogens, but their value is lim- 
ed by observations that suggest that 
pproximately 60% of chemicals identified as car- 
nogens in life-exposure studies produce mainly 
sgative findings in short-term genotoxcity tests 
\shby, 1992; Parodi, 1992). Thus, there is cur- 
:ntly no reliable and rapid means of evaluating 
le carcinogenic risk of new chemicals that fall 
ito this latter group of compounds, termed non- 
*notoxic (or epigenetic) carcinogens. 

One approach to addressing this problem is to 
ucidate the molecular mechanisms by which 
nown non-genotoxic carcinogens act. It should 
len be possible to identify common factors/ 
lechanisms that can serve as early biomarkers of 
ircinogenic potential for new chemicals. To this 
id, a large number of groups have reported on 
le various effects of non-genotoxic compounds 
n various animal species (Marsman et al., 1988; 
ake et al., 1993; Cattley et al., 1994; Hayashi et 
I., 1994; Human and Experimental Toxicology, 
994; Anderson et al., 1996). However, the mech- 
nistic picture is still far from complete with many 
f those genes involved in the carcinogenic pro- 
*ss remaining unknown, and their identification 
lerefore remains a key goal in elucidating the 
lolecular mechanisms by which non-genotoxic 
arcinogenesis occurs. 

Subtractive hybridisation (SH) and related tech- 
ologies such as representational difference analy- 
s (RDA) (Hubank and Schatz, 1994) and 



differential display (DD) (Liang and Pardee, 
1992) can be used to aid the isolation of genes 
showing altered expression in target tissues fol- 
lowing exposure to a chemical stimulus. These 
techniques can also be used to identify differential 
gene expression in neoplastic and normal cells 
(Liang et al., 1992), infected and normal cells 
(Duguid and Dinauer, 1990), differentiated and 
undifferentiated cells (Sargent and Dawid, 1983; 
Guimaraes et al., 1995), activated and dormant 
cells (Gurskaya et al., 1996; Wan et al., 1996), 
different cell types (Hedrick et al., 1984; Davis et 
al., 1984) amongst others. Most importantly, us- 
ing such approaches, no prior knowledge of the 
specific genes that are upregulated/downregulated 
is required. 

Using a variation of SH, termed suppression- 
PCR subtractive hybridisation (SSH) (Diatchenko 
et al., 1996), we have previously reported the 
isolation of a number of genes showing altered 
expression in male rat liver following acute expo- 
sure to phenobarbital (Rockett et al., 1997). In 
the current work we have used the same experi- 
mental approach to isolate genes that are differen- 
tially expressed in the livers of male rats and 
guinea pigs following short-term (3-day) exposure 
to the peroxisome proliferator (PP) and non- 
genotoxic hepatocarcinogen, Wy- 14,643. We have 
isolated and identified a number of gene species, 
some of which may be important in the induction 
of, or protection against, non-genotoxic 
hepatocarcinogenesis. 

2. Materials and methods 

2.L Animals and treatment 

All animal experiments were undertaken in ac- 
cordance with Her Majesty's Home Office De- 
partment guidelines under the auspices of 
approved personal and project licences. Male 
Wistar rats (150-200 g) and male Duncan-Hart- 
ley guinea pigs (250-300 g) were obtained from 
Kingman and Bantam (Hull, UK). Upon receipt, 
both groups were randomly assigned into two 
groups of four. They were maintained on a rat, 
mouse or guinea pig standard diet (B&K Univer- 
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sal, Hull) and a daily cycle of alternating 12-h 
. periods of dark and light. The room temperature 
was maintained at 19°C and a relative humidity of 
55%. The animals were acclimatised to this envi- 
. ronment for 7 days before treatment commenced. 
[4-chloro-6-(2,3-xylidino)-2-pyrimidinylthio] acetic 
acid (Wy- 14,643, Campo, Emmerich; 250 mg/kg 
per day in corn oil) was administered by gavage 
to the treated groups of rats and guinea pigs on 3 
consecutive days, whilst control groups received 
an equal volume of corn oil only. During this 
time, all animals had free access to food and 
water. The animals were killed by cervical disloca- 
tion on the fourth day, and their livers immedi- 
ately excised, weighed, sliced into approximately 
0.5-cm cubes, snap frozen in liquid nitrogen and 
stored at - 70°C. 

22 mRNA extraction 

Approximately 0.25 g of each frozen liver sam- 
ple was ground under liquid nitrogen using a 
mortar and pestle. Messenger RNA was extracted 
from the ground liver using the PolyATtract® 
System 1000 kit (Promega, Madison, USA) ac- 
cording to the technical manual provided by the 
manufacturers. The mRNA was DNase- treated 
(RQ Rnase-free Dnase, Promega, final concentra- 
tion 10 U/ml) before phenol/chloroform extrac- 
tion and ethanol precipitation. The mRNA was 
redissolved at a final concentration 500-1000 ng/ 
ill 

2.3. cDNA Subtraction 

This was carried out using the PCR-Select™ 
cDNA Subtraction Kit (Clontech, Palo Alto, 
USA) according to the manufacturer's instruc- 
tions. Subtractions were carried out with mRNAs 
derived from single animals. The mRNA from the 
remaining three animals in each group was later 
used for quantitative RT-PCR analysis of specific 
genes. 

2.4. Band extraction and cloning 

The secondary PCR reactions from the cDNA 
subtraction procedure were run on a 2% 



Metaphor agarose gel (FMC, Rockland, USA) 
containing 0.5 ^g/ml ethidium bromide (Sigma, 
Dorset, UK). One times TAE (0.04 M Tris-ac- 
etate, 0.001 M EDTA) was used to prepare the gel 
and as the running buffer. After running for 6-7 
h at 3.75 V/cm, the gel was overstained for 30 min 
with SYBR Green I DNA stain (FMC, 1:10000 
dilution in 1 x TAE). Each discernible band of 
the differential display pattern was extracted from 
the gel with a scalpel and the DNA eluted using a 
Genelute™ agarose spin column (Supelco, Belle- 
fonte, USA). Five microlitres of the eluted DNA 
was reamplified using the original nested (sec- 
ondary) PCR primers supplied with the PCR-Se- 
lect™ cDNA subtraction kit. The PCR products 
were electrophoresed on a 2% standard agarose 
gel (Boehringer Mannheim, East Sussex, UK) and 
the reamplified target bands extracted from the 
gel as above. The eluted DNA was immediately 
ligated into a TOPO TA Cloning® vector (Invitro- 
gen, Carlsbad, USA) before transformation in 
Escherichia coli TOPI OF' One Shot™ cells 
(Invitrogen). 

2.5. Colony screening 

2.5.1. Stage I 

Eight transformed (white) colonies from each 
band were grown up for. 6 h in 200 \i\ LB broth 
containing ampicillin (Sigma, 50 mg/ml). One mi- 
crolitre of this was subjected to PCR using the 
same conditions and nested primers as described 
above. One tenth (2 \xl) of the completed PCR 
reaction was electrophoresed on a 2% standard 
agarose gel and one tenth on a 2% standard 
agarose gel containing HA red (Hanse Analytik 
GmbH, Bremen, Germany, 1 U/ml) to discern the 
differentially cloned products. The remainder of 
the PCR reaction was used to prepare duplicate 
dotblots on Hybond N + membranes (Amersham, 
Little Chalfont, UK). 

25:2 Stage II 

The duplicate dotblots were probed with (a) the 
final differential display reaction and (b) the 're- 
verse-subtracted' differential display reaction. To 
make the 'reverse-subtracted' probe, the subtrac- 
tive hybridisation step of the differential display 
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RT-PCR procedure was carried out using the 
original tester (treated) mRNA as the driver and 
the original driver (control) mRNA as the tester. 
Probing and visualisation were carried out using 
the ECL direct nucleic acid labelling and detec- 
tion system (Amersham, Little Chalfont, UK) ac- 
cording to the manufacturer's instructions. Those 
clones that were positive for (a) but negative for 
(b), or showed a substantially larger positive sig- 
nal with (a) compared to (b), were selected for 
DNA sequence analysis. 

2.6. DNA sequencing 

The remainder of the cultures (prepared in 
stage 1 screening) containing different cloning 
products (as discerned in the two screening steps) 
were grown up overnight in 5 ml LB broth con- 
taining ampicillin (50 mg/ml). A plasmid miniprep 
was prepared from each (Wizard Plus SV 
Minipreps DNA purification system, Promega) 
according to the manufacturer's instructions. The 
cloned inserts were sequenced on an automated 
ABI DNA sequencer (Applied Biosystems, War- 
rington, UK) using the Ml 3 forward primer 
(GTAAAACGACGGCCAGT) or Ml 3 reverse 
primer (AACAGCTATGACCATG). 

2. 7. Identification of differentially regulated genes 

Gene sequences thus obtained were identified 
using the FASTA 3.0 programme (Lipman and 
Pearson, 1985; Pearson and Lipman, 1988) (http:/ 
/www. ddbj.nig.ac.jp/E-mail/homology. html) to 
search all EMBL databases for matching DNA 
sequences. Each clone sequence was submitted in 
the forward and reverse direction, and the one 
returning the highest statistical probability of 
match to a known sequence was noted. Sequence 
homologies between our submitted clone sequence 
and the queried database sequence were deter- 
mined (by FASTA) over a region of at least 60 
base pairs. 

2.8. RT-PCR analysis of selected candidate genes 

cDNA sequences of the target genes were ob- 
tained from the NIH gene database (GenBank at 



http://www.ncbi.nlm.nih.gov/Web/Search/index. 
html) and the computer programme gene 
jockey (BioSoft, Cambridge, UK) used to select 
primer pairs from these sequences. Where guinea 
pig sequences were available, rat and guinea pig 
sequences were aligned and primers chosen from 
regions of homology. If guinea pig sequences were 
not available, rat and human sequences were 
used. In cases where exact homology could not be 
found, the sequence from the rat was used. In the 
case of CD81 only, no rat or guinea pig sequences 
were available and so mouse and human se- 
quences were aligned and a primer pair chosen 
from a region of homology. Primers (obtained 
from Gibco-BRL, Paisley, UK) were dissolved at 
a concentration of 50 pmol/|il in sterile distilled 
water and stored at — 20°C. The primer pairs 
used plus other reaction parameters are shown in 
Table 1. mRNA was extracted (as described 
above) from all four treated animals and from 
three animals in the control group. Integrity of 
the eluted mRNA was confirmed on a 2% agarose 
gel, and the concentration and purity were mea- 
sured using a Genequant II spectrophotometer 
(LKB, Bromma, Sweden) and then diluted to 10 
ng/[il. One microlitre of this latter solution was 
used per RT-PCR reaction. 

RT-PCR was carried out in a single tube (50 |il) 
reaction using the Access RT-PCR system 
(Promega) according to manufacturer's instruc- 
tions. In the kinetic and quantitative analyses, 
omission of RNA was used as a control for the 
presence of any contaminating DNA. After ob- 
taining a PCR signal of the correct size and 
optimising the reaction conditions, each PCR 
product was digested with between two and four 
separate restriction enzymes. Specific restriction 
patterns were thus obtained, which further confi- 
rmed the identity of the PCR products as being 
the original target genes. Kinetic analysis (14-32 
cycles) was then performed in each case to deter- 
mine the location of the mid-log phase. 

For the semi-quantitative analysis of each 
target gene, RT-PCR reactions were carried out in 
triplicate for each sample to reduce the effect of 
intertube RT-reaction variations (Kolls et al., 
1993) and pipetting errors. For each gene, a mas- 
termix containing enough reagents for three times 
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the number of samples (seven for rat, six for 
guinea pig) was prepared except that mRNA was 
omitted, the latter being added after aliquoting 49 
jil of the mastermix into an appropriate number 
of tubes. Amplification of albumin (the reference 
gene) was carried out in separate tubes since the 
mid-log phase of this gene is at a much lower 
cycle number than the target genes due to its high 
abundance. All RT-PCR products were analysed 
on 2% agarose gels containing 0.5 \ig/m\ ethidium 
bromide. The target gene samples were loaded on 
the gel first and run in at 3 V/cm for 10 min. The 
corresponding albumin samples were then loaded 
and the gel run for a further 1/2 h. In this way, all 



L 1 2 L 1 2 




A B 

Fig. 1. Final displays of differentially expressed genes that 
were (1) upregulated and (2) dowhregulated in rat (A) and 
guinea pig (B) livers following 3-day treatment with Wy- 
14,643. mRNA extracted from control and treated livers was 
used to generate the differential displays using the PCR-Select 
cDNA subtraction kit (Clontech). Lane (L) is a I Kb DNA 
Ladder standard and 10 \i\ of secondary PCR reaction were 
loaded in all other lanes. 



RT-PCR products from each target gene and 
albumin from the corresponding samples could be 
run on the same gel. Gels were photographed 
using type 665 posi-neg film (Sigma) and quanti- 
tation of the band intensity was carried out using 
a dual wavelength flying spot laser scanner densit- 
ometer (Shimadzu). 

29. Statistical analysis 

Statistical analysis of unpaired samples was car- 
ried out using the two-tailed Student's Mest. Val- 
ues were considered statistically significant at 
P < 0.05 or less. 



3. Results 

3.1. Cloning and screening of transcripts 

For both the rat and guinea pig experimental 
groups, cDNA subtraction was carried out in the 
forward (control driving tester) and reverse (tester 
driving control) directions to isolate both upregu- 
lated and downregulated mRNA species respec- 
tively. Using a standard primary hybridisation 
time of 8 h we obtained a substantial amount of 
non-specific products in all the final differential 
displays (data not shown). This background 
smearing was almost completely removed by re- 
ducing the primary hybridisation time to 4 h 
(CLONTECHniques, 1996). Fig. 1 shows the 
ddRT-PCR patterns of genes showing altered ex- 
pression in rat and guinea pig liver following 
3-day treatment with Wy- 14,643. The profiles are 
unique for each species, and in each case the 
profile for the upregulated genes (control mRNA 
driving tester mRNA) is different to that obtained 
for the downregulated genes (tester mRNA driv- 
ing control mRNA). 

The practical outcome of the SSH method is 
that a series of differentially expressed genes is 
observed as a ladder on an agarose gel. The 
majority of these gene fragments fall within the 
150-2000 bp range, with bands up to 5 Kbp 
occasionally being observed. Each band may the- 
oretically consist of one or more products of 
similar size, as the gel has a maximum resolution 
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Fig. 2. Discrimination of different ddRT-PCR products having 
the same molecular size using HA-red. Gel (A) is a 2% 
standard agarose gel. Gel (B) is a 2% standard agarose gel 
containing 1 U/ml HA-red. Band numbers refer to the sequen- 
tial bands (largest to smallest) extracted from the original 
display of genes upregula ted in rat liver following 3-day treat- 
ment with Wy- 14,643. Ten micorlitres of each PCR reaction 
were loaded per lane. 



of approximately 1.5% (3 bp per 200). In addi- 
tion, there may be two or more products that are 
the 'same size, but have a different sequence. 



Therefore some form of discrimination must be 
employed to isolate as many of these products as 
possible. HA-red screening (Geisinger et al M 1997) 
of a number of clones derived from each band 
provided a means to discriminate between differ- 
ent gene species of the same size. A typical exam- 
ple of such a gel is shown in Fig. 2. In total, 88 
and 48 apparently different clones were obtained 
from the final differential expression patterns of 
upregulated and downregulated rat genes, respec- 
tively. Sixty nine and 89 apparently different 
clones were obtained from the final differential 
expression patterns of the upregulated and down- 
regulated guinea pig genes, respectively. 

Having identified as many different candidate 
gene products as possible in the screening step I, a 
second screening step was carried out on every 
clone to confirm those that represented true dif- 
ferentially expressed genes. This is necessary since 
no subtraction technique is 100% efficient. The 
approach we used, termed PCR-select differential 
screening (as recommended in Clontech's PCR-se- 
lect cDNA subtraction kit protocol), utilises the 
forward and reverse subtractions as an aid to 
screening for the true differentially expressed 
genes (CLONTECHniques, 1997). Because these 
probes have already undergone subtraction, they 
have been enriched for differentially expressed 
genes and are therefore more sensitive than un- 
subtracted driver/tester cDNA probes for detect- 
ing true differential expression. All the clones that 
were isolated from each display were dotblotted 
and probed with the display from which they was 
obtained, plus the corresponding reverse-sub- 
tracted display. An example of such a blot is 
shown in Fig. 3. Clones corresponding to authen- 
tic differentially expressed mRNAs hybridised 
with the subtracted cDNA probe, but not the 
reverse-subtracted probe. We also included in the 
authentic positives, those clones that gave a sub- 
stantially greater signal with the subtracted probe 
compared to the reverse-subtracted probe. False 
positives hybridised with either both probes or 
with neither probe. Of the original 88 upregulated 
and 48 downregulated rat clones selected for this 
screening step, 28 (32%) and 15 (31%) respec- 
tively, were found to be true positives. In the rat, 



0 



J.C. Rockett et ai / Toxicology 144 (2000) 13-29 



:8 (100%) of the true positive upregulated genes 
Table 2) and 11 (73%) of the true positive down- 
egulated genes (Table 3) were non-redundant. Of 
he original 69 upregulated and 89 downregulated 
;uinea pig clones selected for this screening step, 
8 (70%) and 37 (42%) respectively, were found to 
»e true positives. Thirty six (75%) of the upregu- 
ated genes (Table 4) and 33 (89%) of the down- 
egulated genes (Table 5) were non-redundant. 

12. Identification of clones 

On sequence analysis it was found that some 
lones were unsequencable in the first instance 
Ml 3 forward primer) due to long polyA runs 
hat appeared to prematurely terminate the se- 
[uencing reaction. These clones were therefore 
esequenced from the opposite direction using the 
413 reverse primer. Those xenobiotic-modulated 
;ene products identified to date are listed in Ta- 
>les 2 and 3 (rat) and Tables 4 and 5 (guinea pig). 




ig. 3. Dot blots of clones of putative upregulated gene species 
;olated from guinea pig liver following 3-day treatment with 
v^y-14,643. All clones identified in the stage I screening step 
;ee methods) were blotted and probed with (A) the differen- 
al display from which they originated (control driving 
eated) and (B) the reverse subtraction (treated driving con- 
ol). Arrows indicate some of the true differentially expressed 
lones. 



Table 2 

Identification of genes that were upregulated in male rat liver 
following 3-day treatment with WY- 14,643 



FASTA-EMBL gene Accession No. Sequence 
identification (rat un- homology 0 (%) 

less otherwise stated) 



Carnitine octanoyl 


RN26033 


99 


transferase 






NCI_CGAP_Lil (H. 


HS 1275949 


98 


sapiens) (EST b ) 






Peroxisomal enoyl 


RN08976 


98 


hydratase-like 






protein 






Liver fatty acid bind- 


V01 235 


96 


ing protein 






Soares mouse 


AA038051 


96 


p3NMF19.5 M. 






musculus cDNA 






clone 






Cytochrome 


RNCYPLA 


94 


p450IVAl 






Mit. 3-hydroxyl-3- 


RNHMGCOA 


94 


methylglutaryl 






CoA synthase 






Rabgeranylgeranyl 


RNRABGERA 


94 


transferase compo- 






nent B 






Genes for 18S, 5.8S, 


RNRRNA 


94 


and 28S ribosomal 






RNAs 






Carnitine acetyl ^ 


MMRNACAR 


92 


transferase (mouse) 






Soares mouse NML 


MM1157113 


92 


(EST) 






Bone marrow stromal 


AA545726 


92 


fibroblast (H. sapi- 






ens) cDNA clone 






HBMSF2E4 (EST) 






7.5dpc embryo 


AA408192 


92 


(mouse) (EST) 






Alpha- 1 -macroglobu- 


RNALPH1M 


91 


lin 






Transferrin 


RNTRANSA 


91 


Lecithinxholesterol 


RNU62803 


90 


acyltransferase 






Zn-ct2 -glycoprotein 


RNZA2GA 


90 


Serum albumin 


RNJALBM 


89 


Fructose- 1 ,6-bisphos- 


RNFBP 


88 


phate 1-phospho- 






hydrolase 






Soares mouse 


A A 124706 


88 


melanoma (EST) 






(S c ) 






Soares mouse 


AA 154039 


88 



3NbMS (EST) 
(AS C ) 
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Table 2 (Continued) 



FASTA-EMBL gene 
identification (rat un- 
less otherwise stated) 



Accession No. 



Sequence 
homology 0 (%) 



Table 3 

Identification of genes that were downregulated in male rat 
liver following 3-day treatment with Wy- 14,643 



17-P-hydroxsteroid de- 
hydrogenase 

Soares mouse 
p3NMF19.5 (EST) 

Peroxisomal enoyl- 
CoA:hydratase -3- 
hydroxyacyl CoA 
bifunctional enzyme 

Integral membrane 
protein, TAPA-1 
(CD81) (mouse) 

Scares mouse lymph 
node (EST) 

H. sapiens (clone 
zap 128) mRNA 

Lysophospholipase ho- 
mologue (human) 

Soares mouse lymph 
node (EST) 



RN17BHDT2 

AAO38051 

RNPECOA 

S45012 

MMAA88445 
L40401 
HSU67963 
AA2 17044 



87 
87 
85 

81 

81 
76 
76 
74 



0 Refers to the nucleotide sequence homology between the 
cloned band isolated from the differential display and the 
corresponding gene derived from the EMBL gene sequence 
bank. 

b EST is 'expressed sequence tag' — a gene of as yet 
unknown identity and function. 

c Where sequence homologies were equal in both directions 
of the isolated band, both the sense (S) and antisense (A) 
identities are given. 



FAST-EMBL gene 
identification (rat un- 
less otherwise stated) 



Accession No. 



Sequence 
homology 11 (%) 



NCI_CGAP_Lil {H. 
sapiens) (EST b )(S c ) 

NCI_CGAP_Prl (H. 
sapiens) (EST)(AS C ) 

UDP-glucuronosyl- 
transferase 
(UGT2B12) 

Complement compo- 
nent c3 

Soares mouse pla- 
centa (S) 

Ape (chimpanzee) 28 S 
rRNA (AS) 

Rat CYP2C11 

Ribosomal protein S5 

Transthyretin 

Contrapsin-like 
protease inhibitor 

Prostaglandin F2a (S) 

P-2-microglobulin 
(AS) 

Apolipoprotein C-III 
Parathymosin-alpha 

(zinc2 + -binding 

protein) 



AA484528 
AA469320 
RN06273 

RNC3 

AA023305 

PTRGMC 

RNCYPM1 
RNRPS5 
RNTTHY 
RNCCP23 

RN26663 
RNB2MR 

RNAPOA02 
RN11ZNBP 



99 
99 
98 

96 

96 

96 

95 
94 
94 
89 

84 
84 

82. 
75 



. 11 Refers to the nucleotide sequence homology between the 
cloned band isolated from the differential display and the 
corresponding gene derived from the EMBL gene sequence 
bank. 

b EST is 'expressed sequence tag' — a gene of as yet 
unknown identity and function. 

c Where sequence homologies were equal in both directions, 
both the sense (S) and antisense (A) identities are given. 



In all cases, both the forward and reverse se- 
quence of the target clones were analysed and the 
gene having the highest statistical homology 
noted. 

3.3. RT-PCR analysis of selected clones 

The results of a typical RT-PCR semi-quantita- 
tion experiment for transferrin in the rat is given 
in Fig. 4 and the results for a total of 12 selected 
genes in both the rat and guinea pig are shown in 
Table 6. 



4. Discussion 

It is now apparent that all cancers arise from 
accumulated genetic changes within the cell. Al- 
though documenting and explaining these changes 
presents a formidable obstacle to understanding 
the different mechanisms of carcinogenesis, the 
experimental methodology is now available to 
begin attempting this difficult challenge. In order 
to begin the elucidation of the molecular mecha- 
nisms involved in non-genotoxic hepatocarcino- 
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enesis, we have used SSH to identify a number of 
;enes that are upregulated or downregulated in 
iale rat and guinea pig livers following short 
srm exposure to the PP, Wy- 14,643. We have 
ised the rat model to represent a species suscepti- 
ve to the non-genotoxic carcinogenic effect of 
>Ps and the guinea pig as a resistant species 
Orton et al., 1984; Rodricks and Turnbull, 1987; 



Lake et al., 1989; Makowska et al., 1992; Lake et 
al., 1993). 

Gurskaya et al. (1996), who originally devel- 
oped the SSH technique, cloned the products of 
the secondary PCR reaction and screened a small 
number of randomly selected colonies for differ- 
entially expressed clones using northern hybridisa- 
tion. However, we decided against this approach 



able 4 

dentification of genes that were upregulated in male guinea pig liver following 3-day treatment with WY-14,643 



;acta pmri ppnp identification f guinea Die unless otherwise stated) 


Accession No. 


Sequence 




homology" (%) 


,a TOO Xy IcMciaoC 


ABO 10634 


97 


,ompiemeni lj proiein ivjr*~.j; 


M34054 


97 


" , »f*-rtcr\Iir> a1rlf»hvHp HphvHmDPna^P (sheCD) 
.ylOSOIlC dlUCIljUC UClljUlV^tiiaov 


U12761 


92 


,alalabG \llUHla.UJ 


X04076 


89 


aitOCnonQnai aspartate drninuiianaicitiac 


Ml 1732 


89 


UOngallOn IaClUi- 1 -ajpila \lauviv J 


X62245 


88 


jr*i rr.AP Rr? M r/i n i>nr rDNA clone (FST^ (Similar to chick mit nhosrjhoenolDvru- 


AA587436 


87 


vale cdruUAyKiuddC^ 






i Irtha - 1 -n ntirtrotPina *?p S 


M57270 


83 


fi-fnrmvltetrahvdrofolate dehydrogenase (rat) 


M59861 


83 


* iKncnmnl nrntpin I 6 (rati 


X87107 


83 


ngrpr nrpffnant uterus Nb fESX) (mouse) 


A A 156847 


83 


4itochondrial citrate transport protein (human) 


L77567 • 


80 


"Vtrmln^mir rhanemnin hTRiC5 (human) 


U17104 


80 


Upha-l-antiproteinase F 


M57271 


77 


leterogeneous nuclear ribonuclearprotein cl/c2 (human) 


D28382 


77 


loares parathyroid tumour (EST) (similar to human serum albumin precursor) 


AA860651 


76 


Itratagene mouse kidney (EST) 


AA 107327 


75 


loares parathyroid tumour NbHPA human cDNA (EST) 


AA860653 


74 


loares mouse mammary gland (EST) 


AA619297 


74 


:DNA clone 15 004 (EST) (human) 


HO 1826 


74 


Joares senescent fibroblasts (EST) (mouse) 


W52190 


74 


3 reproalbumin (human) 


E04315 


72 


IDNA clone 73 169 (EST) (human) 


T56624 


72 


/itamin D-binding protein (human) 


L10641 


71 


KpoH gene (exon 8) (human) 


Yl 1498 


71 


ISRL flow sorted chromosome 


B05457 


71 


;oares foetal liver spleen (EST) (mouse) 


AA009524 


71 


ioares foetal heart NbMH19W (EST) (mouse) 


AA009421 


69 


>oares foetal heart NbHH19W H. sapiens cDNA clone (EST) 


W94377 


67 


Phenylalanine hydroxylase (human) 


U49897 


67 


5 roline-5-carboxylate dehydrogenase (human) 


U24266 


66 


jlutathione-5-transferase homologue (human) 


U90313 


65 


MCI_CGAP_GCBI (EST) (human) 


AA769294 


65 


Protective protein (human) 


M22960 


64 


Tlone 27 375 (EST) (human) 


N37046 


62 


itratagene colon ( # 937 204) H. sapiens cDNA clone (EST) 


AA149777 


62 



u Refers to the nucleotide sequence homology between the cloned band isolated from the differential display and the correspond- 
ng gene derived from the EMBL gene sequence bank. 
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Identification of genes that were downregulated in male guinea 
pig liver following 3-day treat ment with WY-14,643 

FASTA-EMBL gene Accession No. Sequence 
• identification (guinea homology" (%) 

pig unless otherwise 
stated) 



Pnirmlement C3 


M34054 


97 




D84339 


95 


Murinoclobulin 


A lr»V»5»»1 -an- 


M57271 


88 


tinrnteinase F 


X62245 


89 


fclongation iatiui ax 


nha-l (rabbit) 


X04409 




Coupling protein vj 


88 




AA586309 


87 


NLl_tuAr_Uvi 


/fqt 1 ^ ( human) 

^£jl ) ^ll union/ 


D 13668 


85 


juecitnin.cijQicMci uj 


arptvl transferase 






(rabbit) 


X00270 


84 


Aldolase d (numanj 


A nti.tVirnmhin III 


E00116 


80 


(human) 


K03020 


80 


Phpnvl alanine hv- 


^rrkwlnQP (human) 




79 


lnter-a-trypsin m- 


D38595 






78 


Normausea rai mus 


AA849753 






78 


TsJfSrmali^ed rat ovary 

1>) Ul 11 loll 5v*J ioi v ▼ i** j 


AA801059 




X00284 


77 


Complement factor 


Ba iragmem vnu- 






man) 




76 


r^iii\*/4rrKHiril Hphvdro- 


U05598 


(TPTiacp (human) 


Y08409 


75 


CMAtl/l n*»rip (thvrnid- 
op0ti 4 T gene v iI v lwlu 


inducible hepatic 






rtmtpinVhuman) 


AC004236 


75 


RAP clone 174ol2 


(human) 


X05409 


74 


Mitochondrial alde- 


hyde dehydroge- 






nase (human) 




74 


Preproalbumin (hu- 


E04315 


man) 




74 


NCI_CGAP_Pr9 


AA533142 


(EST) (human) (S) 


AA851197 


74 


Normalised rat pla- 


centa (EST) (AS) 




73 


Heparin sulfate pro- 


J04621 


teoglycan (human) 


R24330 


73 


cDNA clone 33 992 


(EST) (human) 







Table 5 (Continued) 



FASTA-EMBL gene Accession No. 


Sequence 


identification (guinea 


homology 0 (%) 


pig unless otherwise 




stated) 




Retinol dehydrogenase U33501 


71 


(rat) 


71 


TAPA-1 integral mem- S45012 


Hrane nrotein 




(PD81) (mouse) 


70 


rnmnlpmPTlt COTTIDO- M35525 


nent c5s 


69 


A r-ki^li-nrknrntpin R (nit?) LI 1235 
/\p01ipopiUlClii *-> \y l h) 


cDNA clone 143 918 R76742 


68 


/pcx^ (human) 

^XZiO 1 ) ^liuiiioiiy 


68 


rv-fihnnoeen (human) K02569 


Cno«-Ac frtptnl livpr W03726 

ooares iociai nvci tt ^-^ * ^ v 


68 


cnleen INF (mouse) 




Rarctead bowel (EST) AA232049 


67 


(mouse) 


66 


T TTiP oliirnrnnosvl AF0309137 


iransierabc 




IpuV-apmia rell L 08246 

IViyeiOlU iCUKaCJJllo wtn LiVUi^w 


65 


QlIlClCULlaLlClI 




protein ^iviv^l- i ) 




^nillllali^ \u) 


65 


c T c c 14 r T r. M 987 (hu-G27984 


man) (AS) 


o4 


Soares mouse AA222798 


3NME125 


64 


Stratagene mouse em- AA 199420 


bryonic (EST) (S) 


63 


Rad 52 (mouse) AF004854 



a Refers to the nucleotide sequence homology between the 
cloned band isolated from the differential display and the 
corresponding gene derived from the EMBL gene sequence 



bank. 

b EST is 'expressed sequence tag' — a gene of as yet 
unknown identity and function 

c Where sequence homologies were equal in both directions, 
boththe sense (S) and antisense (A) identities are given. 

for several reasons: (1) the kinetics of ligation and 
transformation favour the isolation of smaller 
PCR products, thereby producing a misrepresen- 
tation of larger gene products; (2) northern blot 
analysis is notoriously insensitive and is unlikely 
to confirm expression of rare transcripts; (3) there 
is no measurable end point to the screening of 
clones produced in this way other than to analyse 
every transformed colony. We used instead an 
alternative approach; after running out the differ- 
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ential display on a high-resolution agarose gel 
(Fig. 1) and overstating with SYBR Green I to 
enhance visualisation, the composite bands were 
individually extracted, reamplified and cloned. 
However, it has been well documented that single 
bands from differential displays often contain a 
heterogeneous mixture of different products 
(Mathieu-Daude et aL, 1996; Smith et al. s 1997). 
This is because polyacrylamide gels cannot dis- 
criminate between DNA sequences that differ in 
size by less than about 0.2% (Sambrook et al., 
1989). High-resolution agarose gels such as those 
used in this work are even less sensitive, normally 
only discriminating products that differ in size by 
at least 1.5%. The use of the HA-red screening 
step enables resolution of identical or nearly iden- 
tical sequences based on their AT content (Wawer 
et al., 1995) and is sensitive down to < 1% differ- 
ence. Furthermore, it is rapid, technically simple 
and does not require the use of radiolabels. 
Geisinger et al. (1997) originally demonstrated the 
usefulness of using HA-red to identify different 
products cloned from the same band of an RNA 
differential display experiment by simultaneously 
running them in normal agarose (to discriminate 
by size) and in normal agarose containing HA-red 
(to discriminate by AT content). We have found 
that this approach is equally useful for identifying 
different gene species cloned from the same band 
of our SSH display. 

Diatchenko et al. (1996) reported that SSH is 
highly efficient at producing differentially ex- 
pressed gene species. However, we also included a 
second screening step to further confirm that the 
clones isolated from the differential display were 
indeed differentially expressed. Duplicate dotblots 
of the candidate clones were blotted with the 
display from which they were originally isolated 
and with the 'reverse subtraction' display. To 
make the reverse-subtracted probe, the subtractive 
hybridisation step of the procedure was carried 
out using the original tester cDNA as a driver, 
and the original driver cDNA as a tester. In this 
way, clones that are false positives can be iden- 
tified through their presence in both blots. Such 
false positives most commonly arise through hav- 
ing a very high abundance in the initial sample or 
unusual hybridisation properties (Li et al., 1994). 



Although the SSH method itself has been 
shown to be efficient, and despite the screening 
step that we included, there is an important caveat 
to bear in mind — namely that it is important 
that all clones be considered only as 'candidates' 
until the actual abundance of their mRNA is 
quantitated in treated and control samples. To- 
wards this end, we examined the expression of a 
limited number of clones using semi-quantitative 
RT-PCR. Albumin was used as the reference gene 
as we have previously found that the expression 
of this gene does not appear to change with the 
treatment regime that we used (Fig. 4, and data 
not shown). There are a number of interesting 
points to note from our results. The first is the 
presence of genes that serve as appropriate posi- 
tive controls in the upregulated and downregu- 
lated series. For example, in the rat it can be seen 
that CYP4AI expression increases 14-fold follow- 
ing treatment. Although CYP4AI mRNA expres- 
sion levels following WY- 14,643 treatment have 
not been previously reported in this model, the 
figure compares favourably with that recorded by 
Bell et al. (1991), who used RNAse-protection to 
quantitate CYP4A1 in rat liver following treat- 
ment with methylclofenapate, another PP. In ad- 
dition, we also confirmed that the peroxisomal 
enoyl-CoA:hydratase-3-hydroxyacyl-CoA Afunc- 
tional enzyme is also upregulated 9-fold, in agree- 
ment with the findings of Chen and Crane (1992). 

A number of genes were downregulated follow- 
ing Wy-14,643 exposure, including CYP2C11 ex- 
pression. Corton et al. (1997) reported similar 
findings and suggested that this may in part ex- 
plain why male rats exposed to Wy-14,643 and 
some other PPs have high serum estradiol levels, 
as estradiol is a substrate for CYP2C1 1. We have 
also shown that the expression of contrapsin-like 
protease inhibitor (CLPI) was downregulated by 
Wy-14,643. This has not previously been reported, 
and we suggest that it may be linked to a require- 
ment for increased availability of amino acids to 
accommodate the hepatomegaly induced by treat- 
ment. Although little is known of the function of 
parathymosin-a, (zinc 2 + -binding protein) it has 
been shown to interact with the globular domain 
of histone HI, suggesting a role in histone func- 
tion (Kondili et al., 1996). In contrast to the 
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Albumin 
Transferrin 



Albumin 
Transferrin 



Fig. 4. Semi-quantitative RT-PCR experiment showing relative decrease in expression of transferrin in treated rat liver (RT-1 to 
RT-4) compared to controls (RC-1 to RC-3). An equal amount of mRNA was used in each reaction (10 ng), and each sample was 
quantitated in triplicate to reduce the effects of inter-tube variation. N is negative control (no mRNA). Lane M is a 100 bp ladder 
and lane L is a 1 Kb DNA ladder. 



downregulation observed in this work, other stud- 
ies have shown that parathymosin-a expression is 
elevated in breast cancer (Tsitsilonis et al., 1993, 
1998), with the implication that parathymosin-a 
may somehow be involved in regulating cell pro- 
liferation by more than one mechanism. Transfer- 
rin has previously been shown to be 
downregulated in rat liver by hypolipidemic PPs 
(Hertz et al., 1996). It is therefore interesting to 
note that we isolated a clone identified as transfer- 
rin from the upregulated display profile. Since we 
confirmed by RT-PCR that transferrin is in fact 
downregulated in the rat (Fig. 4), we conclude 
that transferrin was either a false positive or was 
incorrectly identified. It could also be that we 
have isolated a close relative, splice variant or 
isoform of transferrin, which demonstrates a dif- 
ferent expression profile under these experimental 
conditions. Further investigations are therefore 



required to determine which of these possibilities 
are correct. 

One of our most intriguing observations was 
that one gene, CD81, appeared to be upregulated 
in rat liver but downregulated in guinea pig liver 
following Wy- 14,643 exposure. CD81 is a widely 
expressed cell surface protein that is involved in a 
large number of cellular functions, including ad- 
hesion, activation, proliferation and differentia- 
tion (reviewed by Levy et al., 1998). Since all of 
these functions are altered to some extent in car- 
cinogenesis, it is perhaps an important observa- 
tion that CD81 expression is differentially 
regulated in a resistant and sensitive species ex- 
posed to a non-genotoxic carcinogen. 

Albumin and ribosomal genes appear common, 
to all differential displays and are thus undesir- 
able false positives. However, due to their high 
expression in the liver, they are difficult to re- 
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move. We also noted a number of gene species, 
particularly in the guinea pig, which were com- 
mon to both upregulated and downregulated 
profiles. Again, the most likely reason for these 
having arisen is their high abundance. 

A relatively large number of upregulated and 
downregulated genes were isolated from guinea 
pig liver following Wy-14,643 exposure. However, 
the guinea pig genome has been relatively poorly 
characterised and so many of the clones were 
identified as resembling genes or ESTs from other 
species. Without full-length sequence data it is 
difficult to ascertain the accuracy of the assigned 
identities and this must be borne in mind when 
utilising data such as this, for example, in design- 
ing effective primers for RT-PCR studies. Al- 
though the actual isolated clone sequences can be 
used to do this, their relatively small size often 
restricts the ability to design effective primers. In 
addition, as we observed with transferrin, using a 
published full-length sequence may help to iden- 
tify false positives. 



By comparing the expression profiles of genes 
showing altered expression in a PP-sensitive spe- 
cies (rat) with a PP-resistant species (guinea pig), 
it was our aim to identify genes that are mecha- 
nistically relevant to the non-genotoxic hepatocar- 
cinogenic action of Wy-14,643. However, few of 
the genes that we have isolated were common to 
both the rat and the guinea pig. This suggests 
either that the molecular mechanisms of response 
in these two species are so different that few genes 
are commonly regulated in response to Wy-14,643 
exposure, or that we have recovered only a small 
proportion of those genes that have altered ex- 
pression. The latter seems the more likely scenario 
since it is perceived that one of the main problems 
of subtractive hybridisation and other differential 
expression technologies is the inability to consis- 
tently isolate rare gene transcripts (Bertioli et al., 
1995). This is potentially problematic in that 
weakly expressed genes may play an important 
role in regulating key cellular processes, and that 
the majority of mRNA species are classified as 



Table 6 

Semi-quantitative RT-PCR analysis of selected gene species in the rat and guinea pig 3 



Transcript Putative change of expression following Change according to RT-PCR 

treatment according to dotblot quantitation 



Rat Guinea pig Rat Guinea pig 



Albumin 


N/A 


N/A 


No change 


No change 


Bifunctional enzyme 


Up 


N/A 


Upregulated* (9 x ) 


N/O 


CYP2C11 


Down 


N/A 


Downregulated* 


N/D 








(Abolished) 




CYP4A1 


Up 


N/A 


Upregulated* (14 x ) 


N/D 


Catalase 


N/A 


Up 


No change 


N/O 


CD81 (TAPA-1) 


Up 


Down 


N/O 


Upregulated**(1.4 


Contrapsin-like protease inhibitor 


Down 


N/A 


Downregulated** 


x) 
N/D 








(0.5 x) 




Parathymosin-a (zinc 2+ binding 


Down 


N/A 


Downregulated** 


N/D 


protein) 






(0.6 x) 




Transferrin 


Up 


N/A 


Downregulated* 


No change 








(0.5 x ) 




UDP-Glucuronosyl transferase 


Down 


N/A 


Downregulated** 


N/O 








(0.2 x) 




DownUnknown-1 


Down 


N/A 


No change (P = 0.06) 


N/D 


Zn-a2-glycoprotein 


Up 


N/A 


No change 


N/O 



a N/A, not applicable; N/O, not optimised; N/D, not done. 
* P< 0.0005; 
** P<0.05. 
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'fare' in abundance (Bertioli et al., 1995). How- 
ever in their original paper describing the SSH 
technique, Gurskaya et al. (1996) demonstrated 
that SSH can enrich rare molecules between 1000- 
and 5000-fold in a single round of hybridisation. 
Unfortunately, due to high background smearing 
in our initial experiments (which hindered identifi- 
cation of single bands), we were compelled to 
reduce the primary hybridisation time to only 4 h 
_ a step that theoretically is likely to reduce the 
number of rare sequences (CLONTECHniques, 
1996). Furthermore, it has been claimed by the 
manufacturers that, whilst this technique can 
identify changes as small as 1.5-fold between the 
driver and tester populations, it is best suited to 
the isolation of genes that show a greater than 
5-fold increase (CLONTECHniques, 1996). In ad- 
dition, where tester and driver contain genes with 
large and small differences in abundance, the SSH 
method will be biased towards identifying those 
genes with the large differences (CLONTECH- 
niques, 1996). Thus, it is most probable that we 
have not isolated all of the more rarely expressed 
transcripts and those demonstrating small changes 

in expression. 

One problem that remains is identifying the 
function of genes isolated in SSH experiments as 
described herein, some of which may be crucial to 
the process of carcinogenesis, and are, to date, 
unidentified. However, we have provided evidence 
herein that SSH can be used to begin the process 
of characterising the extent and importance of 
altered gene expression in response to a chemical 
stimulus. The developments of this approach 
should include characterisation of temporal and 
dose responses, and functional analysis studies 
including knockout mice. In combination, such 
studies should make a significant contribution to 
our understanding of the molecular mechanisms 
of action and physiological relevance of gene reg- 
ulation in non-genotoxic hepatocarcinogenesis. It 
should then be. possible to ascertain whether dif- 
ferentially expressed genes are causally or casually 
related to the chemical-induced toxicity, and 
therefore a substantial mechanistic advance. 

It is clear that there are also broader applica- 
tions for this experimental approach that go be- 
yond understanding the molecular mechanisms of 



peroxisome-proliferator induced non-genotoxic 
hepatocarcinogenesis in rodents. The potential 
medical and therapeutic benefits of elucidating the 
molecular changes that occur in any given cell in 
progressing from the normal to the carcinogenic 
(or other diseased, abnormal or developmental) 
state are very substantial. Notwithstanding the 
lack of complete functional identification of al- 
tered gene expression, such gene profiling studies 
described herein essentially provides a 'fingerprint' 
of each stage of carcinogenesis, and should help in 
the elucidation of specific and sensitive biomark- 
ers for different types of cancer. Amongst other 
benefits, such fingerprints and biomarkers could 
help uncover differences in histologically identical 
cancers, and provide diagnostic tests for the earli- 
est stages of neoplasia. In addition, the genes 
identified by this approach may be incorporated 
into gene-chip DNA-arrays, thus providing a 
standard genetic fingerprint for a particular toxin 
treatment in a particular species. Interrogation of 
these gene arrays for an unknown compound that 
has a similar pattern to the known reference 
chemical would then provide evidence that the 
unknown may have a toxicity profile similar to 
the 'standard' fingerprint, thereby serving as a 
mechanistically relevant platform for further de- 
tailed investigations. 
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ABSTRACT We have developed high-density DNA mi- 
croarrays of yeast ORFs. These microarrays can monitor 
hybridization to ORFs for applications such as quantitative 
differential gene expression analysis and screening for se- 
quence polymorphisms. Automated scripts retrieved sequence 
information from public databases to locate predicted ORFs 
and select appropriate primers for amplification. The primers 
were used to amplify yeast ORFs in 96- well plates, and the 
resulting products were arrayed using an automated micro 
arraying device. Arrays containing up to 2,479 yeast ORFs 
were printed on a single slide. The hybridization of fluores- 
cently labeled samples to the array were detected and quan- 
titated with a laser confocal scanning microscope. Applica- 
tions of the microarrays are shown for genetic and gene 
expression analysis at the whole genome level. 

The genome sequencing projects have generated and will con- 
tinue to generate enormous amounts of sequence data. The 
genomes of Saccharomyces cere\nsiae, Haemophilus influenzae (1), 
Mycoplasma genitaliwn (2), and Methanococcus jannischii (3) 
have been completely sequenced. Other model organisms have 
had substantial portions of their genomes sequenced as well 
including the nematode Caenorhabduis elegans (4) and the small 
flowering plant Arabidopsis thaliana (5). Given this ever- 
increasing amount of sequence information, new strategies are 
necessary to efficiently pursue the next phase of the genome 
projects— the elucidation of gene expression patterns and gene 
product function on a whole genome scale. 

One important use of genome sequence data is to attempt 
to identify the functions of predicted ORFs within the genome. 
Many of the ORFs identified in the yeast genome sequence 
were not identified in decades of genetic studies and have no 
significant homology to previously identified sequences in the 
database. In addition, even in cases where ORFs have signif- 
icant homology to sequences in the database, or have known 
sequence motifs (e.g., protein kinase), this is not sufficient to 
determine the actual biological role of the gene product. 
Experimental analysis must be performed to thoroughly un- 
derstand the biological function of a given ORFs product. 
Model organisms, such as 5. cerevisiae, will be extremely 
important in improving our understanding of other more 
complex and less manipulate organisms. 

To examine in detail the functional role of individual ORFs and 
relationships between genes at the expression level, this work 
describes the use of genome sequence information to study large 
numbers of genes efficiently and systematically. The procedure 
was as follows. (/) Software scripts scanned annotated sequence 
information from public databases for predicted ORFs. (it) The 
start and stop position of each identified ORF was extracted 
automatically, along with the sequence data of the ORF and 200 
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bases flanking either side. (Hi) These data were used to automat- 
ically select PCR primers that would amplify the ORF. (iv) The 
primer sequences were automatically input into the automated 
multiplex oligonucleotide synthesizer (6). (v) The oligonucleo- 
tides were synthesized in 96-well format, and (vi) used in 96-well 
format to amplify the desired ORFs from a genomic DNA 
template, (vii) The products were arrayed using a high-density 
DNA arrayer (7-10). The gene arrays can be used for hybridiza- 
tion with a variety of labeled products such as cDNA for gene 
expression analysis or genomic DNA for strain comparisons, and 
genomic mismatch scanning purified DNA for genotyping (11). 

METHODS 

Script Design. All scripts were written in UNDCTool Command 
Language. Annotated sequence information from GenBank was 
extracted into one file containing the complete nucleotide se- 
quence of a single chromosome. A second fue contained the 
assigned ORF name followed by the start and stop positions of that 
ORF. The actual sequence contained within the specified range, 
along with 200 bases of sequence flanking both sides, was extracted 
and input into the primer selection program primer as (White- 
head Institute, Boston). Primers were designed so as to allow 
amplification of entire ORFs. The selected primer sequences were 
read by the 96-well automated multiplex oligonucleotide synthe- 
sizer instrument for primer synthesis. The forward and reverse 
primers were synthesized in two separate 96-well plates in corre- 
sponding wells. All primers were synthesized on a 20-nmol scale. 

ORF Amplification and Purification. Genomic DNA was iso- 
lated as described (12) and used as template for the amplification 
reactions. Each PCR was done in a total volume of 100 pi. A total 
of 0.2 pM each of forward and reverse primers were aliquoted into 
a 96-well PCR plate (Robbins Scientific, Sunnyvale, CA); a master 
mix containing 0.24 mM each dNTP, 10 mM Tris (pH &5) t 50 mM 
MgG 2 , 25 units Taq polymerase, and 10 ng of template was added 
to the primers, and the entire mix was thermal cycled for 30 cycles 
as follows: 15 min at 94°C 15 min at 54°C, and 30 min at 72°C 
Products were ethanol precipitated in polystyrene v-bottom 96- 
well plates (Costar). All samples were dried and stored at -20°C. 

Arraying Procedure and Processing. Microarrays were 
made as described (8). 

A custom built arraying robot was used to print batches of 48 
slides. The robot utilizes four printing tips which simultaneously 
pick up —1 /tl of solution from 96-well microtiter plates. After 
printing, the microarrays were rehydrated for 30 sec in a humid 
chamber and then snap dried for 2 sec on a hot plate (100°Q. The 
DNA was then UV crosslinked to the surface by subjecting the 
slides to 60 millijoules of energy. The rest of the pofy-L-Jysine 
surface was blocked by a 15 -min incubation in a solution of 70 mM 
succinic anhydride dissolved in a solution consisting of 315 ml of 
l-methyl-2-pyrrolidinone (Aldrich) and 35 ml of 1 M boric acid 
(pH 8.0). Directly after the blocking reaction, the bound DNA 
was denatured by a 2-min incubation in distilled water at ~95°C. 

Abbreviation: YEP, yeast extract/peptone. 

tTo whom reprint requests should be sent at the present address: 
Synteni, Inc., 6519 Dumbarton Circle, Fremont, CA 94555. 
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Fig. 1. Two-color fluorescent scan of a yeast microarray contain- 
ing 2,479 elements (ORFs). The center-to-center distance between 
elements is 345 pm. A probe mixture consisting of cDNA from yeast 
extract/peptone (YEP) galactose (green pseudocolor) and YEP glu- 
cose (red pseudocolor) grown yeast cultures was hybridized to the 
array. Intensity per element corresponds to ORF expression, and 
pseudocolor per element corresponds to relative ORF expression 
between the two cultures. 

The slides were then transferred into a bath of 100% ethanol at 
room temperature. 

Probe Preparation: cDNA. Yeast cultures (100 ml) were grown 
to ~1 OD A 600 and total RNA was isolated as described (13). Up 
to 500 jig total RNA was used to isolate mRNA (Qiagen, 
Chatsworth, CA). Oligo(dT)20 (5 jig) was added and annealed to 
2 fLg of mRNA by heating the reaction to 70°C for 10 min and 
quick chilling on ice, plus 2 /il Superscript II (200 units/pd) (life 
Technologies, Gaithersburg, MD), 0.6 ^1 50x dNTP mix (final 
concentrations were 500 yM dATP, dCTP, dGTP, and 200 yM 
dTTP), 6 /J 5x reaction buffer, and 60 jiM Cy3-dUTP or 
Cy5-dUTP (Amersham). Reactions were carried out at 42°C for 
2 h, after which the mRNA was degraded by the addition of 0.3 
p.1 5 M NaOH and 0.3 *d 100 mM EDTA and heating to 65°C for 
10 min. The sample was then diluted to 500 yA with TE and 
concentrated using a Microcon-30 (Amicon) to 10 fil 

Probe Preparation: Genomic DNA. Fluorescent DNA was 
prepared from total genomic DNA as follows: 1 /utg of random 
nonamer oligonucleotides was added to 2.5 jxg of genomic 
DNA. This mixture was boiled for 2 min and then chilled on 
ice. A reaction mixture containing dNTPs (25 /tM dATP, 
dCTP, dGTP, 10 mM dTTP, and 40 yM Cy3-dUTP or 
Cy5-dUTP) reaction buffer (New England Biolabs), and 20 
units exonuclease free Klenow enzyme (United States Bio- 
chemical) was added, and the reaction was incubated at 37°C 
for 2 h. The sample was then diluted to 500 pd with TE and 
concentrated using a Microcon-30 (Amicon) to 10 ji.1. 

Hybridization. Purified, labeled probe was resuspended in 1 1 
jilof 3.5 X SSC containing 10 fig Escherichia coli tRNA, and 0.3% 
SDS. The sample was then heated for 2 min in boiling water, 
cooled rapidly to room temperature, and applied to the array. The 
array was placed in a sealed, humidified, hybridization chamber. 
Hybridization was carried out for 10 h in a 62°C water bath, after 
which the arrays were washed immediately in 2x SSC/0.2% SDS. 
A second wash was performed in 0.1 x SSC. 

Analysis and Quantitation. Arrays were scanned on a 
scanning laser fluorescence microscope developed by Steve 
Smith with software written by Noam Ziv (Stanford Univer- 



sity). A separate scan was done for each of the two fluoro- 
phores used. The images were then combined for analysis. A 
bounding box, fitted to the size of the DNA spots, was placed 
over each array element. The average fluorescent intensity was 
calculated by summing the intensities of each pixel present in 
a bounding box and then dividing by the total number of pixels. 
Local area background was calculated for each array element 
by determining the average fluorescent intensity at the edge of 
the bounding box. To normalize for f luorophore-specific vari- 
ation, control spots containing yeast genomic DNA were 
applied to each quadrant during the arraying process. These 
elements were quantitated and the ratios of the signals were 
determined. These ratios were then used to normalize the 
photomultiplter sensitivity settings such that the ratios of the 
fluorescence of the genomic DNA spots were close to a value 
of 1.0. The average signal intensity at any given spot was 
regarded as significant if it was at least two standard deviations 
above background. Each experiment was conducted in dupli- 
cate, with the fluorophores representing each channel re- 
versed. The ratios presented here are the average of the two 
experiments, except in the case in which the signal for the 
element in question was below the reliability threshold. The 
reliability threshold also determined the dynamic range of the 
experiment. For all of the experiments presented, the average 
dynamic range was M to 100. In the case where the fluores- 
cence from a very bright spot saturates the detector, differ- 
ential ratios will, in general, be underestimated. This can be 
compensated for by scanning at a lower overall sensitivity. 

RESULTS 

The accumulation of sequence information from model organ- 
isms presents an enormous opportunity and challenge to under- 
stand the biological function of many previously uncharacterized 
genes. To do this accurately and efficiently, a directed strategy 
was developed that enables the monitoring of multiple genes 
simultaneously. Microarraying technology provides a method by 
which DNA can be attached to a glass surface in a high-density 
format (8). In practice, it is possible to array over 6,000 elements 
in an area less than 1.8 cm 2 . Given that the yeast genome consists 
of ^6,100 ORFs, the entire set of yeast genes can be spotted onto 
a single glass slide. 

With this capability and the availability of the entire se- 
quence of the yeast genome, our strategy was to use a directed 
approach for generating the complete genome array. This 
procedure involved synthesizing a pair of oligonucleotide 
primers to amplify each ORF. The PCR product containing 
each gene of interest was arrayed onto glass and used, for 
example, as probe for monitoring gene expression levels by 
hybridizing to the array labeled cDNA generated from isolated 
mRNA of a culture grown under any experimental condition. 

Primer Selection and Synthesis. The primer selection was fully 
automated using Tool Command Language scripts and primer 
0.5. (Whitehead). Primer pairs were automatically selected suc- 
cessful for >99% of the ORFs tested. Primer sequences can thus 
be selected rapidly with minimal manual processing. A complete 
set of forward and reverse primers were selected initially for each 
ORF on chromosomes I, II, III, V, VI, VIII, IX, X, and XI. 
Primers for a representative set of ORFs (15% coverage) were 
chosen for the remaining chromosomes. With the release of the 
entire yeast genome sequence, the complete set of primers has 
now been selected. 

Because each ORF requires a unique pair of synthetic primers, 
a total of approximately 12,200 oligonucleotides will be required 
to individually amplify each target. This costly component was 
addressed with the automated multiplex oligonucleotide synthe- 
sizer (6) which efficiently synthesizes primers in a 96-well format. 
Each primer, synthesized on a 20-nmo! scale, provides enough 
material for 100 amplification reactions, whereas a given PCR 
product provides enough material to generate an element on 
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Table 1. Heat shock vs. control expression data 



Ratio of 
gene expression 



Control 


Heat 


ORF 


Gene 






2.2 


YLR142 


PUT1 


Proline oxidase 




2.0 


YOL140 


ARG8 


Acctylomithine aminotransferase 


2.3 . 




YGL148 


AR02 


Chorismatc synthase 




36.0 


YFL014 


HSP12 


Heat shock protein 




27.4 


YBR072 


HSP26 


Heat shock protein 




6.7 


YBR054 


YR02 


Similarity to HSP30 heat shock protein Yrolp 




3.4 


YCR021 


HSP30 


Heat shock protein 




2.6 


YER103 


SSA4 


Heat shock protein 




2.5 


YLR259 


HSP60 


Mitochondrial heat shock protein HSP60 




2.1 


YBR169 


SSE2 


Heat shock protein of the HSP70 family 




1.7 


YBL075 


SSA3 


Cytoplasmic heat shock protein 




1.4 


YPL240 


HSP82 


Heat shock protein 




1.4 


YDR258 


HSP78 


Mitochondrial heat shock protein of clpb family 


1.0 




YNL007 


SIS1 


Heat shock protein 


1.1 




YEL030 




70-kDa heat shock protein 


1.9 




YHR064 




Heat shock protein 




1.3 


YBL008 


HIR1 


Histone transcription regulator 


2.6 




YBL002 


HTB2 


Histone H2B.2 


3.3 




YBL003 


HTA2 


Histone H2A.2 


3.3 




YBR010 


HHT1 


Histone H3 


3.9 




YBR009 


HHF1 


Histone H4 




2.4 


YDR343 


HXT6 


High-affinity hcxosc transporter 




2.1 


YHR092 


HXT4 


Moderate- to low-affinity glucose transporter 


3.6 




YAR071 


PHOll 


Secreted acid phosphatase, 56 kDa isozyme 




2.3 


YLR096 


KIN2 


Ser/Thr protein kinase 


2.5 




YER102 


RPS8B 


RibosomaJ protein S8.c 


2.6 




YBR181 


RPS101 


Ribosomal protein S6.e 


2.6 




YCR031 


CRY1 


40S ribosomal protein S14.e 


2.7 




YLR441 


RP10A 


Ribosomal protein S3.a.e 


2.8 




YHR141 


RPL41B 


Ribosomal protein L36a.e 


2.8 




YBL072 


RPS8A 


Ribosomal protein S8.e 


2.8 




YHL015 


URP2 


Ribosomal protein 


2.8 




YBR191 


URP1 


Ribosomal protein L21.e 


3.1 




YLR340 


RPLAO 


Acidic Ribosomal protein LlO.e 


3.3 




YGL123 


SUP44 


Ribosomal protein 




5.8 


YLR194 




Hypothetical protein 



500-1,000 arrays. Thus, a single primer pair provides enough 
starting material for up to «*50,000 arrays. 

Primers were synthesized to amplify yeast ORFs. Primer 
synthesis had a failure rate of <1% in over 18 plates of 
synthesis as determined by standard trityl analysis (6). The 
success rate of the PCR amplifications using the primer pairs 
was 94% based on agarose gel analysis of each PCR. The 
purified PCR products were used to generate arrays. Two 
versions of the arrays were created for the experimental results 
presented here. The first array contained 2,287 elements and 
the second array batch contained 2,479 elements. 

Genome Arrays. The amplified ORFs were arrayed onto glass 
at a spacing of 345 microns (Fig. 1). The high-density spacing of 
DNA samples allows the hybridization volumes to be mini- 
mized — volumes are a maximum of 10 /il. The labeled probe can 
thus be maintained at relatively high concentrations, making 1-2 
fig of mRNA sufficient for analysis. This also obviates the need 
for a subsequent amplification step and thus avoids the risk of 
altering the relative ratios of different cDNA species in the 
sample. 

Genetic Analysis: Genomic Comparison of Unrelated Strains. 
Microarrays allow efficient comparison of the genomes of dif- 
ferent strains. Genomic DNA from Y55, an S. cerevisiae strain 
divergent from the reference strain S288c, was randomly labeled 
with Cy3-dUTP and hybridized simultaneously with the S288c 
DNA labeled with Cy5-dUTP. When a comparison between the 
hybridization of the DNA from the two strains was done, several 



elements gave relatively little or no signal above background from 
the Cy3 channel (data not shown). These include SGE1, 
ASP3A-D, YLR156, YLR159, YLR161, ENA2 (YDR039 is 
ENA2), and YCR105. These results imply that the regions 
containing these genes are extremely divergent, or all together 
deleted from the strain. Subsequent attempts to generate PCR 
products from SGE1, EN A 2, and ASP3A using Y55 DNA failed. 
This result supports the conclusion that these genes are likely to 
be missing from the Y55 genome. It is interesting to note that at 
least two of the regions absent in the Y55 genome have been 
previously shown or suggested to be deleted in mutant laboratory 
strains (14-16). In particular, the Asp-3 region appears to be 
highly prone to being deleted (15, 16). 

These results indicate that gene arrays can be used to efficiently 
screen different strains of an organism for large deletion poly- 
morphisms. A single hybridization and scan will reveal differences 
based on differential hybridization to particular elements. It is 
reasonable to suppose that an equivalent number of genes are 
present in the Y55 genome and absent in the S288c genome. This 
result should be viewed as a minimum estimate of the deletion 
polymorphisms that exist between these two unrelated strains as 
intergenic deletions or small intragenic deletions would not be 
detected because considerable hybridizing material would be 
remain. Sequence polymorphisms, such as deletions, are present 
in populations of every species and must at some level affect 
phenotype. One of the challenges of the genome era will be to 
critically examine sequence polymorphisms that exist in the 
natural gene pool relative to the reference genome sequence. 
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Heat Shock 
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Fig. 2. ORF categories displaying dif- 
ferential expression between heat shocked 
and untreated cultures. Bars within cate- 
gories correspond to individual ORFs. 
Green shaded bars correspond to relative 
increases in ORF expression under 25°C 
growth conditions. Red shaded bars cor- 
respond to relative increases in ORF ex- 
pression under 39°C growth conditions. 



Gene Expression Analysis. The arrays were used to examine 
gene expression in yeast grown under a variety of different 
conditions. Expression analysis is an ideal application of these 
arrays because a single hybridization provides quantitative expres- 

Table 2. Cold shock vs. control expression data 



sion data for thousands of genes. To better understand results for 
genes of known function, ORFs were placed in biologically rele- 
vant categories on the basis of function (e.g., amino acid catabolic 
genes) and/or pathways (e.g., the histidine biosynthesis pathway). 



Ratio of 
gene expression 



Control 


Cold 


ORF 


Gene 


Description 




3.3 


YOR153 


PDR5 


Pleiotropic drug resistance protein 


2.4 




YCR012 


PGK1 


Phosphoglycerate kinase 


2.9 




YCL040 


GLK1 


Aldohexose specific glucokinase 




1.4 


YHR064 




Heat shock protein 


2.0 




YJL034 


KAR2 


Nuclear fusion protein 


2.1 




YDR258 


HSP78 


Mitochondrial heat shock protein of clpb family of ATP-dependent proteases 


2.2 




YLL039 


UBI4 


Ubiquitin precursor 


2.7 




YLL026 


HSP104 


Heat shock protein 


3.1 




YER103 


SSA4 


Heat shock protein 


3.3 




YBR126 


TPS1 


a, a-Trehalosc-phosphate synthase (UDP-forming) 


3.8 




YPL240 


HSP82 


Heat shock protein 


7.9 




YBR054 


YR02 


Similarity to HSP30 heat shock protein Yrolp 


7.9 




YBR072 


HSP26 


Heat shock protein 


16.5 




YCR021 


HSP30 


Heat shock protein 


1.8 




YDR343 


HXT6 


High-affinity hexose transporter 


2.1 




YHR096 


HXT5 


Putative hexose transporter 


2.4 




YFR053 


HXK1 


Hcxokinase I 


2.8 




YHR092 


HXT4 


Moderate- to low-affinity glucose transporter 


3.4 




YHR094 


HXT1 


Low-affinity hexose (glucose) transporter 




2.3 


YHR089 


GAR1 


Nucleolar rRNA processing protein 




1.7 


YLR048 


NAB1B 


40S ribosomal protein p40 homolog b 




1.7 


YLR441 


RP10A 


Ribosomal protein S3a.c 




1.7 


YLL045 


RPL4B 


Ribosomal protein L7a.e.B 




1.6 


YLR029 


RPL13A 


Ribosomal protein L15.e 




1.6 


YGL123 


SUP44 


Ribosomal protein 




3.1 


YBR067 


TIP1 


Cold- and heat-shock-induced protein of the Srpl/Tiplp family 




2.2 


YER011 


TIR1 


Cold -shock-induced protein of the Tirlp, Tiplp family 




2.0 


YCR058 




Hypothetical protein 




4.2 


YKL102 




Hypothetical protein 
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Table 3. Glucose vs. galactose expression data 



Ratio of 
gene expression 



Glucose 


Galactose 


ORF 


Gene 


Description 


2.1 




YHR018 


ARG4 


Arginosuccinate lyase 


3.5 




VPR035 


GLN1 


Gluiamate-ammonia ligase 


2.8 




YML116 


ATR1 


Aminotriazole and 4-nitroquinoline resistance protein 


2.0 




YMR303 


ADH2 


Alcohol dehydrogenase II 


3.7 




YBR145 


ADH5 


Alcohol dehydrogenase V 


3.2 


YBL030 


AAC2 


ADP, ATP carrier protein 2 




2.9 


YBR085 


AAC3 


ADP, ATP carrier protein 




2.7 


YDR298 


ATP5 


H + -transporting ATP synthase 6 chain precursor 




2.5 


YBR039 


ATP3 


H + -transporting ATP synthase 7 chain precursor 




5.5 


YML054 


CYB2 


Lactate dehydrogenase cytochrome b2 




3.4 


YML054 


CYB2 


Lactate dehydrogenase cytochrome b2 




2.3 


YKL150 


MCR1 


Cytochrome-65 reductase 




4.2 


YBL045 


COR1 


Ubiquinol-cytochrome c reductase 44K core protein 




3.5 


YDL067 


COX9 


Cytochrome c oxidase chain VI I A 




2.7 


YLR038 


COX12 


Cytochrome c oxidase, subunit VIB 




2.6 


YHR051 


COX6 


Cytochrome c oxidase subunit VI 




2.4 


YLR395 


COX8 


Cytochrome c oxidase chain VIII 




2.3 


YFR033 


QCR6 


Ubiquinol-cytochrome c reductase 17K protein 




23.7 


YLR081 


GAL2 


Galactose (and glucose) permease 




21.9 


YBR018 


GAL7 


UDP-glucose-hexose-1 -phosphate uridylyltransferase 




21.8 


YBR020 


GAL1 


Galactokinase 




19.5 


YBR019 


GAL10 


UDP-glucose 4-epimerase 




14.7 


YLR081 


GAL2 


Galactose (and glucose) permease 




8.6 


YDR009 


GAL3 


Galactokinase 




3.0 


YML051 


GAL80(1) 


Negative regulator for expression of galactose-induced genes 




2.8 


YML051 


GAL80(2) 


Negative regulator for expression of galactose-induced genes 


2.7 




YER055 


HIS1 


ATP phosphoribosyltransferase 


3.4 




YBR248 


HIS7 


Glutamine amidotransferase/cyclase 








Phosphoribosyl-AMP cyclohydrolasc/phosphoribosyl-ATP pyrophosphatasc/histidinol 


7.4 




YCL030 


HIS4 


dehydrogenase 


5.8 




YKR080 


MTD1 


Methylenetetrahydrofolate dehydrogenase (NAD+) 


. 6.0 




YDR019 


GCV1 


Glycine decarboxylase T subunit 


6.1 




YLR058 


SHM2 


Serine hydroxy methyltransf erase 


8.1 


YML123 


PH084 


High-affinity inorganic phosphate/H + symporter 


3.5 




YDR408 


ADE8 


Phosphoribosylglycinamide formyltransferase (GART) 


3.6 




YDR408 


ADE8 


Phosphoribosylglycinamide formyltransferase (GART) 


4.4 




YAR015 


ADE1 


Phosphoribosylamidoimidazole-succinocarboxamide synthase 


5.6 




YMR300 


ADE4 


Am idophosphoribosyltransferase 


5.6 




YOR128 


ADE2 


Phosphoribosylaminoimidazole carboxylase 


6.0 




YGL234 


ADE5,7 


Phosphoribosylamine-glycine ligase and phosphoribosylformylglycinamidine cyclo-iigase 


6.3 


YBL015 


ACH1 


Acetyl-CoA hydrolase 



two known cold shock genes (TIP1, TIR1) were expressed at 
a significantly higher level in the cold-shocked culture. Genes 
in other functional categories, such as glucose metabolism and 
heat shock displayed a mixed response with expression of some 
genes being unaffected and other genes exhibiting significant 
up- or down-regulation in response to cold shock (Table 2). 

Steady-State Galactose vs. Glucose Results. mRNA was 
isolated from steady-state log phase YEP galactose and YEP 
glucose grown cultures for comparison on the microarrays. As 
expected, the GAL genes were expressed at a much higher 
level in the galactose culture. Many genes were differentially 
expressed in these cultures that were not a priori expected to 
exhibit differential expression. For example, some genes in the 
amino acid catabolic category were up-regulated in the galac- 
tose culture whereas genes in the one-carbon metabolism and 
purine categories were largely or entirely down-regulated in 
the galactose culture (Table 3). Genes in other categories, such 
as amino acid synthesis, abc transporter, cytochrome c, and 
cytochrome b, exhibited mixed responses; some genes in a 
category showed little or no obvious differential expression 
whereas other genes in the same category showed significant 
differential expression in the galactose and glucose cultures. 



Heat Shock Results. A log phase culture growing in YEP/ 
dextrose medium at 25°C was split in half. One half of the 
culture remained at 25°C whereas the other half of the culture 
was shifted to 39°C. mRNA was isolated from both cultures 1 h 
after heat shock for comparison on microarrays and, although 
this time point is not optimal for measuring induction of heat 
shock mRNAs (17), many known heat shock genes exhibited 
considerable induction at this time point (Table 1; Fig. 2). 
Down-regulation of genes in the ribosomal protein and histone 
gene categories was also observed. Differential expression 
between the heat-shocked culture and the control was also 
observed for many other genes. Genes in many categories, such 
as amino acid catabolism and amino acid synthesis, exhibited 
a mixed response with some genes showing little or no 
differential expression and other genes showing a significant 
increase or decrease in gene expression in response to heat 
shock (Table 1; Fig. 2). 

Cold Shock Results. A log phase culture growing in YEP/ 
dextrose medium at 37°C was split in half. One half of the 
culture remained at 37°C while the other half of the culture was 
shifted to 18°C. mRNA was isolated from both cultures 1 h 
after cold shock for comparison on microarrays. As expected, 
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DISCUSSION 
The results of these experiments show that many genes are 
differentially expressed under the three environmental condi- 
tions described here. The expected and predicted changes in gene 
expression, such as HSP12 in the heat-shocked culture, HP1 in 
the cold-shocked culture, and GAL2 in the steady-state galactose 
culture, were observed in every case. However, in addition to the 
expected changes in gene expression, significant differential 
expression was also observed for many other genes that would 
not, a priori, be expected to be differentially expressed. For 
example, expression of PHOll decreased and expression of 
YLR194, KIN2, and HXT6 increased in the heat shocked culture. 
Expression of MST1 and APE3 decreased and expression of 
PDR5 and GAR1 increased in the cold-shocked culture. In 
addition, ADE4 and SER2 were expressed at reduced levels 
whereas PH084 and ACH1 were expressed at higher levels in 
cells grown in galactose compared with cells grown in glucose. 
Differential expression of these and many other genes was specific 
to one of these three environmental conditions. 

Many other genes were found to be differentially expressed 
under more than one condition. When differentially expressed 
genes in cold- and heat-shocked cultures were compared, 30 
genes were found in common. Of these 30 genes, 28 showed 
inverse expression (i.e., increased expression under one condition 
and decreased expression under the other condition). Two genes, 
YCR058 and YKL102, showed elevated expression in response to 
both cold and heat shock. Fifteen genes were found to be 
differentially expressed in both the heat-shocked and steady-state 
galactose cultures: 9 genes showed increased expression and 5 
showed decreased expression under both conditions. Twenty 
genes were differentially expressed in both the cold-shocked and 
steady-state galactose cultures: 8 genes showed decreased expres- 
sion and 5 genes showed increased expression under both con- 
ditions. Six genes showed increased expression in the galactose 
culture and decreased expression in the cold shocked culture. 
One gene (ODP1) showed increased expression in both the 
cold-shocked and steady-state galactose cultures. 

Gene expression is affected in a global fashion when environ- 
mental conditions are changed and both expected and unex- 
pected genes are affected. There is also overlap in the genes that 
are differentially expressed under quite different environmental 
conditions. These results can be rationalized by considering the 
high degree of cross-pathway regulation in yeast. For example, 
there is evidence for cross-pathway regulation between (/) carbon 
and nitrogen metabolism (18), (u) phosphate and sulfate metab- 
olism (19), and (w) purine, phosphate, and amino acid metabo- 
lism (20-24). There are also examples of the interaction of 
general and specific transcription factors (25, 26). Finally, within 
the broad class of amino acid biosynthetic genes, there is evidence 
for amino acid specific regulation of some genes, regulation via 
general control for other genes, and regulation via both specific 
and general control for other genes (22, 27-30). 

Cross-pathway regulation arises from the complex structure 
of promoters. Virtually all promoters contain sites for multiple 
transcription factors and, therefore, virtually all genes are 
subject to combinatorial regulation. For example, the HIS4 
promoter contains binding sites for GCN4 (the general amino 
acid control transcription factor), PH02/BAS2 (a transcrip- 
tional regulator of phosphatase and purine biosynthetic 
genes), and BAS1 (a transcriptional regulator of purine bio- 
synthetic genes) (31). It is likely that the complex effects on 
gene expression described in this work are a direct conse- 
quence of the combinatorial regulation of gene expression. 

These findings illustrate the power of the highly parallel whole 
genome approach when examining gene expression. The global 
effects of environmental change on gene expression can now be 
directly visualized. It is clear that determining the mechanism(s) 
and the functional role of the dramatic global effects on gene 



expression in different environments will be a significant chal- 
lenge. The era of whole genome analysis will, ultimately, allow 
researchers to switch from the very focused single gene/promoter 
view of gene expression and instead view the cell more as a large 
complex network of gene regulatory pathways. 

With the entire sequence of this model organism known, new 
approaches have been developed that allow for genome wide 
analyses (32, 33) of gene function. The genome microarrays 
represent a novel tool for genetic and expression analysis of the 
yeast genome. This pilot study uses arrays containing >35% of 
the yeast ORFs and it is clear that the entire set of ORFs from 
the yeast genome can be arrayed using the directed primer based 
strategy detailed here. Recent advances in arraying technology 
will allow all 6,100 ORFs to be arrayed in an area of less than 1.8 
cm 2 . Furthermore, as the technology improves, detection limits 
will allow less than 500 ng of starting mRNA material to be used 
for making probe. 

The genome arrays provide for a robust, fully automated 
approach toward examining genome structure and gene func- 
tion. They allow for comparisons between different genomes 
as well as a detailed study of gene expression at the global level. 
This research will help to elucidate relationships between 
genes and allow the researcher to understand gene function by 
understanding expression patterns across the yeast genome. 

Support was provided by National Institutes of Health Grant 
PO/HG00205. 
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Exploring the Metabolic and Genetic Control of 
Gene Expression on a Genomic Scale 

Joseph L DeRisi, Vishwanath R. Iyer, Patrick O. Brown* 

DNA microarrays containing virtually every gene of Saccharomyces cerevisiae were used 
to carry out a comprehensive investigation of the temporal program of gene expression 
accompanying the metabolic shift from fermentation to respiration. The expression 
profiles observed for genes with known metabolic functions pointed to features of the 
metabolic reprogramming that occur during the diauxic shift, and the expression patterns 
of many previously uncharacterized genes provided clues to their possible functions. The 
same DNA microarrays were also used to identify genes whose expression was affected 
by deletion of the transcriptional co-repressor TUP1 or overexpression of the transcrip- 
tional activator YAP1. These results demonstrate the feasibility and utility of this ap- 
proach to genomewide exploration of gene expression patterns. 



The complete sequences of nearly a dozen 
microbial genomes are known, and in the 
next several years we expect to know the 
complete genome sequences of several 
metazoans, including the human genome. 
Defining the role of each gene in these 
genomes will be a formidable task, and un- 
derstanding how the genome functions as a 
whole in the complex natural history of a 
living organism presents an even greater 
challenge. 

Knowing when and where a gene is 
expressed often provides a strong clue as to 
its biological role. Conversely, the pattern 
of genes expressed in a cell can provide 
detailed information about its state. Al- 
though regulation of protein abundance in 
a cell is by no means accomplished solely 
by regulation of mRNA, virtually all dif- 
ferences in cell type or state are correlated 
with changes in the mRNA levels of many 
genes. "This is fortuitous because the only 
specific reagent required to measure the 
abundance of the mRNA for a specific 
gene is a cDNA sequence. DNA microar- 
rays, consisting of thousands of individual 
gene sequences printed in a high-density 
array on a glass microscope slide (I, 2), 
provide a practical and economical tool 
for studying gene expression on a very 
large scale (3-6). 

Saccharomyces cerevisiae is an especially 
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favorable organism in which to conduct a 
systematic investigation of gene expression. 
The genes are easy to recognize in the ge- 
nome sequence, cts regulatory elements are 
generally compact and close to the tran- 
scription units, much is already known 
about its genetic regulatory mechanisms, 
and a powerful set of tools is available for its 
analysis. 

A recurring cycle in the natural history 
of yeast involves a shift from anaerobic 
(fermentation) to aerobic (respiration) me- 
tabolism. Inoculation of yeast into a medi- 
um rich in sugar is followed by rapid growth 
fueled by fermentation, with the production 
of ethanol. When the fermentable sugar is 
exhausted, the yeast cells turn to ethanol as 
a carbon source for aerobic growth. This 
switch from anaerobic growth to aerobic 
respiration upon depletion of glucose, re- 
ferred to as the diauxic shift, is correlated 
with widespread changes in the expression 
of genes involved in fundamental cellular 
processes such as carbon metabolism, pro- 
tein synthesis, and carbohydrate storage 
(7). We used DNA microarrays to charac- 
terize the changes in gene expression that 
take place during this process for nearly the 
entire genome, and to investigate the ge- 
netic circuitry that regulates and executes 
this program. 

Yeast open reading frames (ORFs) were 
amplified by the polymerase chain reaction 
(PCR), with a commercially available set of 
primer pairs (8). DNA microarrays, con- 
taining approximately 6400 distinct DNA 
sequences, were printed onto glass slides by 



using a simple robotic printing device (9). 
Cells from an exponentially growing culture 
of yeast were inoculated into fresh medium 
and grown at 30°C for 21 hours. After an 
initial 9 hours of growth, samples were har- 
vested at seven successive 2-hour intervals, 
and mRNA was isolated (10). Fluorescently 
labeled cDN A was prepared by reverse tran- 
scription in the presence of Cy3(green)- 
or Cy5(red)-iabeled deoxyuridine triphos- 
phate (dUTP) (11) and then hybridized to 
the microarrays (12). To maximize the re- 
liability with which changes in expression 
levels could be discerned, we labeled cDNA 
prepared from cells at each successive time 
point with Cy5, then mixed it with a Cy3- 
labeled "reference" cDNA sample prepared 
from cells harvested at the first interval 
after inoculation. In this experimental de- 
sign, the relative fluorescence intensity 
measured for the Cy3 and Cy5 fluors at 
each array element provides a reliable mea- 
sure of the relative abundance of the corre- 
sponding mRNA in the two cell popula- 
tions (Fig. 1). Data from the series of seven 
samples (Fig. 2), consisting of more than 
43,000 expression-ratio measurements, 
were organized into a database to facilitate 
efficient exploration and analysis of the 
results. This database is publicly available 
on the Internet (13). 

During exponential growth in glucose- 
rich medium, the global pattern of gene 
expression was remarkably stable. Indeed, 
when gene expression patterns between the 
first two cell samples (harvested at a 2-hour 
interval) were compared, mRNA levels dif- 
fered by a factor of 2 or more tor only 19 
genes (0.3%), and the largest of these dif- 
ferences was only 2.7-fold ( J 4). However, as 
glucose was progressively depleted from the 
growth media during the course of the ex- 
periment, a marked change was seen in the 
global pattern of gene expression. mRNA 
levels for approximately 710 genes were 
induced by a factor of at least 2, and the 
mRNA levels for approximately 1030 genes 
declined by a factor of at least 2. Messenger 
RNA levels for 183 genes increased by a 
factor of at least 4, and mRNA levels for 
203 genes diminished by a factor of at least 
4. About half of these differentially ex- 
pressed genes have no currently recognized 
function and are not yet named. Indeed, 
more than 400 of the differentially ex- 
pressed genes have no apparent homology 
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to any gene whose function is known (15). 
The responses of these previously unchar- 
acterized genes to the diauxic shift therefore 
provides the first small clue to their possible 
roles. 

The global view of changes in expres- 
sion of genes with known functions pro- 
vides a vivid picture of the way in which 
the cell adapts to a changing environ- 
ment. Figure 3 shows a portion of the yeast 
metabolic pathways involved in carbon 
and energy metabolism. Mapping the 
changes we observed in the mRNAs en- 
coding each enzyme onto this framework 
allowed us to infer the redirection in the 
flow of metabolites through this system. 
We observed large inductions of the genes 
coding for the enzymes aldehyde dehydro- 
genase (ALD2) and acetyl-coenzyme 
A(CoA) synthase (ACSJ), which func- 
tion together to convert the products of 
alcohol dehydrogenase into acetyl-CoA, 
which in turn is used to fuel the tricarbox- 
ylic acid (TCA) cycle and the glyoxylate 
cycle. The concomitant shutdown of tran- 
scription of the genes encoding pyruvate 
decarboxylase and induction of pyruvate 
carboxylase rechannels pyruvate away 
from acetaldehyde, and instead to oxalac- 
etate, where it can serve to supply the 
TCA cycle and gluconeogenesis. Induc- 
tion of the pivotal genes PCKl t encoding 
phosphoenolpyruvate carboxykinase, and 
FBPJ, encoding fructose 1,6-biphos- 
phatase, switches the directions of two key 
irreversible steps in glycolysis, reversing 
the flow of metabolites along the revers- 
ible steps of the glycolytic pathway toward 
the essential biosynthetic precursor, glu- 
coses-phosphate. Induction of the genes 
coding for the trehalose synthase and gly- 
cogen synthase complexes promotes chan- 
neling of glucose-6-phosphate into these 
carbohydrate storage pathways. 

Just as the changes in expression of 
genes encoding pivotal enzymes can pro- 
vide insight into metabolic reprogram- 
ming, the behavior of large groups of func- 
tionally related genes can provide a broad 
view of the systematic way in which the 
yeast cell adapts to a changing environ- 
ment (Fig. 4). Several classes of genes, 
such as cytochrome c-related genes and 
those involved in the TCA/glyoxylate cy- 
cle and carbohydrate storage, were coordi- 
nate^ induced by glucose exhaustion. In 
contrast, genes devoted to protein synthe- 
sis, including ribosomal proteins, tRNA 
synthetases, and translation, elongation, 
and initiation factors, exhibited a coordi- 
nated decrease in expression. More than 
95% of ribosomal genes showed at least 
twofold decreases in expression during the 
diauxic shift (Fig. 4) (13). A noteworthy 
and illuminating exception was that the 



genes encoding mitochondrial ribosomal 
genes were generally induced rather than 
repressed after glucose limitation, high- 
lighting the requirement for mitchondrial 
biogenesis {13). As more is learned about 
the functions of every gene in the yeast 
genome, the ability to gain insight into a 
cell's response to a changing environment 
through its global gene expression patterns 
will become increasingly powerful. 

Several distinct temporal patterns of ex- 
pression could be recognized, and sets of 
genes could be grouped on the basis of the 
similarities in their expression patterns. The 
characterized members of each of these 
groups also shared important similarities in 
their functions. Moreover, in most cases, 
common regulatory mechanisms could be 
inferred for sets of genes with similar expres- 
sion profiles. For example, seven genes 
showed a late induction profile, with mRNA 
levels increasing by more than ninefold at 



the last timepoint but less than threefold at 
the preceding timepoint (Fig. 5B). All of 
these genes were known to be glucose-re- 
pressed, and five of the seven were previously 
noted to share a common upstream activat- 
ing sequence (UAS), the carbon source re- 
sponse element (CSRE) (16-20). A search 
in the promoter regions of the remaining two 
genes, ACRI and \D?2 y revealed that 
ACRJ, a gene essential for ACSJ activity, 
also possessed a consensus CSRE motif, but 
interestingly, IDP2 did not. A search of the 
entire yeast genome sequence for the con- 
sensus CSRE motif revealed only four addi- 
tional candidate genes, none of which 
showed a similar induction. 

Examples from additional groups of 
genes that shared expression profiles are 
illustrated in Fig. 5, C through F. The 
sequences upstream of the named genes in 
Fig. 5C all contain stress response ele- 
ments (STRE), and with the exception 




Fig. 1. Yeast genome microarray. The actual size of the microarray is 18 mm by 18 mm. The 
microarray was printed as described (9). This image was obtained with the same fluorescent 
scanning confocal microscope used to collect alt the data we report (49). A fluorescently labeled 
cDNA probe was prepared from mRNA isolated from cells harvested shortly after inoculation (culture 
density of <5 x 1 0 6 cells/ml and media glucose level of 1 9 g/liter) by reverse transcription in the 
presence of Cy3-dUTP. Similarly, a second probe was prepared from mRNA isolated from cells taken 
from the same culture 9.5 hours later (culture density of -2 x 10 8 cells/ml, with a glucose level of 
<0.2 g/liter) by reverse transcription in the presence of Cy5-dUTP. In this image, hybridization of the 
Cy3-dUTP-tebeled cDNA (that is, mRNA expression at the initial timepoint) is represented as a green 
signal, and hybridization of Cy5-dUTP-labeted cDNA (that is, mRNA expression at 9.5 hours) is 
represented as a red signal. Thus, genes induced or repressed after the diauxic shift appear in this 
image as red and green spots, respectively. Genes expressed at roughly equal levels before and after 
the diauxic shift appear in this image as yellow spots. 
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of HSP42, have previously been shown to 
be controlled at least in part by these 
elements (21-24). Inspection of the se- 
quences upstream of HSP42 and the two 
uncharacterized genes shown in Fig. 5C, 
YKL026c, a hypothetical protein with 
similarity to glutathione peroxidase, and 
YGR043c, a putative transaldolase, re- 
vealed that each of these genes also pos- 
sess repeated upstream copies of the stress- 
responsive CCCCT motif. Of the 13 ad- 
ditional genes in the yeast genome that 
shared this expression profile [including 
HSP30, ALD2, OM45, and 10 uncharac- 
terized ORFs (25)], nine contained one or 
more recognizable STRE sites in their up- 
stream regions. 

The heterotrimeric transcriptional acti- 
vator complex HAP2>3,4 has been shown 
to be responsible for induction of several 
genes important for respiration (26-28). 
This complex binds a degenerate consensus 
sequence known as the CCAAT box (26). 
Computer analysis, using the consensus se- 
quence TNRYTGGB (29), has suggested 
that a large number of genes involved in 
respiration may be specific targets of 
HAP2JA (30). Indeed, a putative 
HAP2,3,4 binding site could be found in 
the sequences upstream of each of the seven 
cytochrome c-related genes that showed 
the greatest magnitude of induction (Fig. 
5D). Of 12 additional cytochrome c-related 
genes that were induced, HAP2 ,3 ,4 binding 
sites were present in all but one. Signifi- 
cantly, we found that transcription of 
HAP4 itself was induced nearly ninefold 
concomitant with the diauxic shift. 

Control of ribosomal protein biogenesis 
is mainly exerted at the transcriptional 
level, through the presence of a common 
upstream-activating element (UAS^g) 
that is recognized by the Rapl DNA-bind- 
ing protein (3 J , 32). The expression pro- 
files of seven ribosomal proteins are shown 
in Fig. 5F. A search of the sequences 
upstream of all seven genes revealed con- 
sensus Rapl -binding motifs (33). It has 
been suggested that declining Rapl levels 
in the cell during starvation may be re- 
sponsible for the decline in ribosomal pro- 
tein gene expression (34)- Indeed, we ob- 
served that the abundance of RAP J 
mRNA diminished by 4.4-fold, at about 
the time of glucose exhaustion. 

Of the 149 genes that encode known or 
putative transcription factors, only two, 
HAP4 and SUM, were induced by a factor of 
more than threefold at the diauxic shift. 
S1P4 encodes a DNA-binding transcrip- 
tional activator that has been shown to 
interact with Snfl, the "master regulator" of 
glucose repression (35). The eightfold in- 
duction of SIP4 upon depletion of glucose 
strongly suggests a role in the induction of 



downstream genes at the diauxic shift. 

Although most of the transcriptional 
responses that we observed were not pre- 
viously known, the responses of many 
genes during the diauxic shift have been 
described. Comparison of the results we 
obtained by DNA microarray hybridiza- 
tion with previously reported results there- 
fore provided a strong test of the sensitiv- 
ity and accuracy of this approach. The 
expression patterns we observed for previ- 
ously characterized genes showed almost 
perfect concordance with previously pub- 
lished results (36). Moreover, the differ- 
ential expression measurements obtained 
by DNA microarray hybridization were re- 
producible in duplicate experiments. For 
example, the remarkable changes in gene 
expression between cells harvested imme- 
diately after inoculation and immediately 
after the diauxic shift (the first and sixth 
intervals in this time series) were mea- 
sured in duplicate, independent DNA mi- 
croarray hybridizations. The correlation 
coefficient for two complete sets of expres- 
sion ratio measurements was 0.87, and for 
more than 95% of the genes, the expres- 



sion ratios measured in these duplicate 
experiments differed by less than a factor 
of 2. However, in a few cases, there were 
discrepancies between our results and pre- 
vious results, pointing to technical limita- 
tions that will need to be addressed as 
DNA microarray technology advances 
(37, 38). Despite the noted exceptions, 
the high concordance between the results 
we obtained in these experiments and 
those of previous studies provides confi- 
dence in the reliability and thoroughness 
of the survey. 

The changes in gene expression during 
this diauxic shift are complex and involve 
integration of many kinds of information 
about the nutritional and metabolic state 
of the cell. The large number of genes 
whose expression is altered and the diver- 
sity of temporal expression profiles ob- 
served in this experiment highlight the 
challenge of understanding the underlying 
regulatory mechanisms. One approach to 
defining the contributions of individual 
regulatory genes to a complex program of 
this kind is to use DNA microarrays to 
identify genes whose expression is affected 



Fig. 2. Trie section of the ar- Growth OD 0.14 Growth OD 0.46 Growth OD 0.8 



ray indicated by the gray box 
in Rg. 1 is shown for each of 
the experiments described 
here. Representative genes 
are labeled. In each of the ar- 
rays used to analyze gene 
expression during the diauxic 
shift, red spots represent 
genes that were induced rel- 
ative to the initial timepoint, 
and green spots represent 
genes that were repressed 




relative to the initial timepoint. Growth OD 1.8 Growth OD 3.7 Growth OD 6.9 



In the arrays used to analyze 
the effects of the tuplA mu- 
tation and YAP1 overexpres- 
sion, red spots represent 
genes whose expression was 
increased, and green spots 
represent genes whose ex- 
pression was decreased by 
the genetic modification. Note 
that distinct sets of genes are 
induced and repressed in the 
different experiments. The 
complete images of each of 
these arrays can be viewed on 
the Internet (73). Cell density 
as measured by optical densi- 
ty (OD) at 600 nm was used to 
measure the growth of the 
culture. 
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by mutations in each putative regulatory 
gene. As a test of this strategy, we analyzed 
the genomewide changes in gene expression 
that result from deletion of the TUPl gene. 
Transcriptional repression of many genes by 
glucose requires the DNA-binding repressor 



Migl and is mediated by recruiting the tran- 
scriptional co- repressors Tupl and Cyc8/ 
Ssn6 (39). Tupl has also been implicated in 
repression of oxygen-regulated, mattng-type- 
specific, and DNA-damage-inducible genes 
(40). 
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Fig 3 Metabolic reprogramming inferred from global analysis of changes in gene expression. Only key 
metabolic intermediates are identified. The yeast genes encoding the enzymes that catalyze each step 
in this metabolic circuit are identified by name in the boxes. The genes encoding succinyl-CoA synthase 
and giycogen-debranching enzyme have not been explicitly identified, but the ORFs YGR244 and 
YPR184 show significant homology to known succinyl-CoA synthase and giycogen-debranching en- 
zymes respectively, and are therefore included in the corresponding steps in this figure. Red boxes with 
white lettering identify genes whose expression increases in the diauxic shift. Green boxes with dark 
green lettering identify genes whose expression diminishes in the diauxic shift. The magnitude of 
induction or repression is indicated for these genes. For multimeric enzyme complexes, such as 
succinate dehydrogenase, the indicated fold-induction represents an unweighted average of all the 
genes listed in the box. Black and white boxes indicate no significant differential expression (less than 
twofold) The direction of the arrows connecting reversible enzymatic steps indicate the direction of the 
flow of metabolic intermediates, inferred from the gene expression partem, after the diauxic shift. Arrows 
representing steps catalyzed by genes whose expression was strongly induced are highlighted in red. 
The broad gray arrows represent major increases in the flow of metabolites after the diauxic shift, 
inferred from the indicated changes in gene expression. 



Wild-type yeast cells and cells bearing 
a deletion of the TUPl gene (tuplb) were 
grown in parallel cultures in rich medium 
containing glucose as the carbon source. 
Messenger RNA was isolated from expo- 
nentially growing cells from the two pop- 
ulations and used to prepare cDNA la- 
beled with Cy3 (green) and Cy5 (red), 
respectively ( J I ). The labeled probes were 
mixed and simultaneously hybridized to 
the microarray. Red spots on the microar- 
ray therefore represented genes whose 
transcription was induced in the tupl A 
strain, and thus presumably repressed by 
Tupl (41 ). A representative section of the 
microarray (Fig. 2, bottom middle panel) 
illustrates that the genes whose expression 
was affected by the tup] A mutation, were, 
in general, distinct from those induced 
upon glucose exhaustion [complete images 
of all the arrays shown in Fig. 2 are avail- 
able on the Internet (J3)]. Nevertheless, 
34 (10%) of the genes that were induced 
by a factor of at least 2 after the diauxic 
shift were similarly induced by deletion of 
TUPl , suggesting that these genes may be 
subject to TUP I -mediated repression by 
glucose. For example, SUC2, the gene en- 
coding invertase, and all five hexose trans- 
porter genes that were induced during the 
course of the diauxic shift were similarly 
induced, in duplicate experiments, by the 
deletion of TUPL 

The set of genes affected by Tupl in this 
experiment also included a-glucosidases, 
the mating-type-specific genes MFAI and 
MFA2, and the DNA damage-inducible 
RNR2 and KNR4* as well as genes involved 
in flocculation and many genes of unknown 
function. The hybridization signal corre- 
sponding to expression of TUPl itself was 
also severely reduced because of the (in- 
complete) deletion of the transcription unit 
in the tuf>IA strain, providing a positive 
control in the experiment (42). 

Many of the transcriptional targets of 
Tupl fell into sets of genes with related 
biochemical functions. For instance, al- 
though only about 3% of all yeast genes 
appeared to be TUP I -repressed by a factor 
of more than 2 in duplicate experiments 
under these conditions, 6 of the 13 genes 
that have been implicated in flocculation 
(15) showed a reproducible increase in 
expression of at least twofold when TUP I 
was deleted. Another group of related 
genes that appeared to be subject to TUPl 
repression encodes the serine -rich cell 
wall mannoproteins, such as Tipl and 
Tirl/Srpl which are induced by cold 
shock and other stresses (43), and similar, 
serine-poor proteins, the seripauperins 
(44). Messenger RNA levels for 23 of the 
26 genes in this group were reproducibly 
elevated by at least 2.5-fold in the tupl A 



www.sciencemag.org • SCIENCE • VOL. 278 • 24 OCTOBER 1997 



683 



strain, and 18 of these genes were induced 
by more than sevenfold when TUP1 was 
deleted. In contrast, none of 83 genes that 
could be classified as putative regulators of 
the cell division cycle were induced more 
than twofold by deletion of TUP I. Thus, 
despite the diversity of the regulatory sys- 
tems that employ Tupl, most of the genes 
that it regulates under these conditions 
fall into a limited number of distinct func- 
tional classes. 

Because the microarray allows us to 
monitor expression of nearly every gene in 
yeast, we can, in principle, use this ap- 
proach to identify all the transcriptional 
targets of a regulatory protein like Tupl. It 
is important to note, however, that in any 
single experiment of this kind we can only 
recognize those target genes that are nor- 
mally repressed (or induced) under the 
conditions of the experiment. For in- 
stance, the experiment described here an- 
alyzed a MAT a strain in which MFAJ 
and MFA2, the genes encoding the a- 
factor mating pheromone precursor, are 
normally repressed. In the isogenic tup J A 
strain, these genes were inappropriately 
expressed, reflecting the role that Tupl 
plays in their repression. Had we instead 
carried out this experiment with a MA TA 
strain (in which expression of MFA1 and 
MFA2 is not repressed), it would not have 
been possible to conclude anything re- 
garding the role of Tupl in the repression 
of these genes. Conversely, we cannot dis- 
tinguish indirect effects of the chronic 
absence of Tupl in the mutant strain from 
effects directly attributable to its partici- 
pation in repressing the transcription of a 
gene. 

Another simple route to modulating the 
activity of a regulatory factor is to overex- 
press the gene that encodes it. YAP1 en- 
codes a DNA-binding transcription factor 
belonging to the b-zip class of DNA-bind- 
ing proteins. Overexpression of YAPJ in 
yeast confers increased resistance to hydro- 
gen peroxide, o-phenanthroline, heavy 
metals, and osmotic stress (45). We ana- 
lyzed differential gene expression between a 
wild-type strain bearing a control plasmid 
and a strain with a plasmid expressing YAPJ 
under the control of the strong GAL1-J0 
promoter, both grown in galactose (that is, 
a condition that induces YAPJ overexpres- 
sion). Complementary DNA from the con- 
trol and YAPJ overexpressing strains, la- 
beled with Cy3 and Cy5, respectively, was 
prepared from mRNA isolated from the two 
strains and hybridized to the microarray. 
Thus, red spots on the array represent genes 
that were induced in the strain overexpress- 
ing YAP1. 

Of the 17 genes whose mRNA levels 
increased by more than threefold when 



YAP1 was overexpressed in this way, five 
bear homology to aryl-alcohol oxidoreduc- 
tases (Fig. 2 and Table 1). An additional 
four of the genes in this set also belong to 
the general class of dehydrogenases/oxi- 
doreductases. Very little is known about 
the role of aryl-alcohol oxidoreductases in 
S. cerevisiae, but these enzymes have been 
isolated from Hgninolytic fungi, in which 
they participate in coupled redox reac- 
tions, oxidizing aromatic, and aliphatic 
unsaturated alcohols to aldehydes with the 
production of hydrogen peroxide (46, 47). 
The fact that a remarkable fraction of the 
targets identified in this experiment be- 
long to the same small, functional group of 
oxidoreductases suggests that these genes 



might play an important protective role 
during oxidative stress. Transcription of a 
small number of genes was reduced in the 
strain overexpressing Yapl. Interestingly, 
many of these genes encode sugar per- 
meases or enzymes involved in inositol 
metabolism. 

We searched for Yapl -binding sites 
(TTACTAA or TGACTAA) in the se- 
quences upstream of the target genes we 
identified (48). About two- thirds of the 
genes that were induced by more than 
threefold upon Yapl overexpression had 
one or more binding sites within 600 bases 
upstream of the start codon (Table 1), sug- 
gesting that they are directly regulated by 
Yapl. The absence of canonical Yapl-bind- 



Fig. 4. Coordinated reg- 
ulation of functionally re- 
lated genes. The curves 
represent the average in- 
duction or repression ra- 
tios for all the genes in 
each indicated group. 
The total number of 
genes in each group was 
as follows: ribosomal 




proteins, 112; translation Timo (houro) 

elongation and initiation 

factors, 25; tRNA synthetases (excluding mitochondial synthetases), 17; glycogen and trehalose syn- 
thesis and degradation, 15; cytochrome c oxidase and reductase proteins, 19; and TCA- and glyoxy- 
late-cycle enzymes, 24. 

Table 1 . Genes induced by YAP1 overexpression. This list includes all the genes for which mRNA levels 
increased by more than twofold upon YAP1 overexpression in both of two duplicate experiments, and 
for which the average increase in mRNA level in the two experiments was greater than threefold (5Q). 
Positions of the canonical Yapl binding sites upstream of the start codon, when present, and the 
average fold-increase in mRNA levels measured in the two experiments are indicated. 
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YNL331C 






Putative aryl-alcohol reductase 


12.9 
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162-222 (5 sites) 




Similarity to bacterial csgA protein 


10.4 


YML007W 


YAP1 


Transcriptional activator involved in 
oxidative stress response 
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223, 242 




Homology to aryl-alcohol 
dehydrogenases 


9.0 


YUL060C 
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Putative glutathione transferase 
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Putative aryl-alcohol dehydrogenase 
(NADP+) 
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resistance protein 
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enzyme), isoform 3 
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YMR251W 
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enzyme), isoform 1 
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MDH2 


Malate dehydrogenase 
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ing sites upstream of the others may reflect 
an ability of Yapl to bind sites that differ 
from the canonical binding sites, perhaps in 
cooperation with other factors, or less like- 
ly, may represent an indirect effect of Yapl 
overexpression, mediated by one or more 
intermediary factors. Yapl sites were found 
only four times in the corresponding region 
of an arbitrary set of 30 genes that were not 
differentially regulated by Yapl. 

Use of a DNA microarray to character- 
ize the transcriptional consequences of 
mutations affecting the activity of regula- 
tory molecules provides a simple and pow- 
erful approach to dissection and character- 
ization of regulatory pathways and net- 



works. This strategy also has an important 
practical application in drug screening. 
Mutations in specific genes encoding can- 
didate drug targets can serve as surrogates 
for the ideal chemical inhibitor or modu- 
lator of their activity. DNA microarrays 
can be used to define the resulting signa- 
ture pattern of alterations in gene expres- 
sion, and then subsequently used in an 
assay to screen for compounds that repro- 
duce the desired signature pattern. 

DNA microarrays provide a simple and 
economical way to explore gene expres- 
sion patterns on a genomic scale. The 
hurdles to extending this approach to any 
other organism are minor. The equipment 



required for fabricating and using DNA 
microarrays (9) consists of components 
that were chosen for their modest cost and 
simplicity. It was feasible for a small group 
to accomplish the amplification of more 
than 6000 genes in about 4 months and, 
once the amplified gene sequences were in 
hand, only 2 days were required to print a 
set of 1 10 microarrays of 6400 elements 
each. Probe preparation, hybridization, 
and fluorescent imaging are also simple 
procedures. Even conceptually simple ex- 
periments, as we described here, can yield 
vast amounts of information. The value of 
the information from each experiment of 
this kind will progressively increase as 
more is learned about the functions of 
each gene and as additional experiments 
define the global changes in gene expres- 
sion in diverse other natural processes and 
genetic perturbations. Perhaps the greatest 
challenge now is to develop efficient 
methods for organizing, distributing, inter- 
preting, and extracting insights from the 
large volumes of data these experiments 
will provide. 
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Human Genome Placed on Chip; Biotech Rivals Put It Up 
for Sale 



By ANDREW POLLACK (NYT) 1030 words 
The genome on a chip has arrived. 

Melding high technology with biology, several companies are rushing to sell slivers of glass or 
nylon, some as small as postage stamps, packed with pieces of all 30,000 or so known human 
genes. 

The new products will allow scientists to scan all genes in a human tissue sample at once, to 
determine which genes are active, a job that previously required two or more chips. The whole- 
genome chips will lower the cost and increase the speed of a widely used test that has 
transformed biomedical research in the last few years. 

"It's sort of a milestone event, very similar to generating an integrated circuit of the genome," 
said Stephen P. A. Fodor, the chief executive of Affymetrix Inc., the leading seller of gene chips, 
which are also called microarrays. 

Affymetrix, based in Santa Clara, Calif, is expected to announce today that it is accepting orders 
for its whole-genome chip. 

The announcement seems timed to steal some thunder from the rival Agilent Technologies, 
which is based in nearby Palo Alto. Agilent is to be the host of an analyst meeting today and it 
plans to announce then that it has started shipping test versions of its whole-genome chip. 

Applied Biosystems of Foster City, Calif., a unit of the Applera Corporation, started the race in 
July with an announcement that it would have a whole-genome chip out by the end of this year. 
NimbleGen Systems, a small company in Madison, Wis., announced a few days later that it had a 
genome on a chip that it was not selling but that it was using to run tests for customers. 

Gene chips, which detect genes that are active, meaning they are being used to make a protein, 
have become essential tools. Scientists try to understand the genetic mechanisms of disease by 
seeing which genes are turned on in, say, a sick kidney or lung compared with those active in a 
healthy organ. Pharmaceutical companies look at gene activity patterns to try to predict the 
effects of drugs. 



Scientists have found that tumors that look the same under the microscope can differ in terms of 
which genes are active. So studying gene patterns could become a way to discriminate between 
deadly and not-so-deadly tumors, or to predict which drug will work best for a particular patient. 

Still, even some vendors conceded that the change from two chips to one is more symbolic than 
revolutionary. 

"You can do just as good science with two chips, it costs you a little more," said Roland Green, 
the vice president for research and development at NimbleGen. 

Some scientists questioned whether the chips really have all human genes, because the exact 
number and identities of all the genes is not known. 

The advent of the genome on a chip is, however, evidence that biotechnology, to the extent that it 
uses electronics, is experiencing some of the rapid progress that has made semiconductors and 
computers continuously cheaper and smaller. 

"One of the effects everyone is looking for in the genomics area is Moore f s law — more data, less 
money," said Doug Dolginow, an executive vice president at Gene Logic, which sells data from 
gene chip studies to pharmaceutical companies. "This is a step in that direction." 

Moore's law states that the number of transistors on a semiconductor chip doubles every 18 
months. 

Affymetrix's gene chips are, in fact, made with the same techniques used to make semiconductor 
chips. In the mid-1990's, the company came out with a set of five chips covering what was then 
known of the human genome. After the human genome sequence was virtually completed in 
2000, the company developed a two-chip set with all the known genes. Now it has the single 
chip, which some scientists say will be more convenient. 

"We like to be able to look at all genes at one time to get a global view of what's going on," said 
John R. Walker, who runs gene chip operations at the Genomics Institute of the Novartis 
Research Foundation in San Diego. 

Costs should also be lower. Gene chips have been so expensive that many academic scientists 
still make their own rather than buy them. Affymetrix said it would sell its whole-genome chips 
for $300 to $500 each, depending on volume, little more than half the price of the two-chip set. 
The other companies have not announced prices. 

For Affymetrix, a successful whole-genome chip "is essential for them to maintain their 
dominance" of high-end microarrays, said Edward A. Tenthoff, an analyst at U.S. Bancorp Piper 
Jaffray. Affymetrix had total product sales in 2002 of about $250 million, and a company 
spokesman said that human genome chips are its top-selling product. 

Mr. Tenthoff, who recommends Affymetrix stock, said the company's sales growth rate had 
moderated as it faces tougher competition. Agilent, a spinoff of Hewlett-Packard that makes its 
gene chips by printing DNA components onto glass slides using ink jet printers, has gained 
share, he said. Applied Biosystems, the largest maker of genomics equipment over all, will be 
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entering the microarray segment of the business with its whole-genome chip, emphasizing the 
connection of that product to the others it offers, including the gene database developed by its 
sister company, Celera Genomics. 

Jeffrey Trent, scientific director of the Translational Genomics Research Institute in Phoenix, 
said that while whole-genome chips are useful for medical discovery, the biggest growth of the 
market will be for chips that can be used by doctors to do diagnoses. And whole-genome chips 
are too cumbersome for that, he said. Rather, once scientists use the whole-genome chips to find 
particular genes that are associated with, say, tumor aggressiveness or drug effectiveness, he 
said, they will then make smaller and cheaper chips containing just those genes for use in 
diagnosis. 
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News@Agilent 

Agilent Technologies ships whole human genome on single 
microarray to gene expression customers for evaluation 

Company to introduce first commercial whole human microarray by end of year 
PALO ALTO, Calif. , Oct 2, 2003 



► Archives 

Agilent Technologies Inc. (NYSE: A) today announced it has shipped whole human-genome microarrays 
to customers for testing and evaluation. The whole genome microarray is based on Agilent's new double- 
density format, which can accommodate 44,000 features on a single 1" x 3" glass-slide microarray. The 
new platform enables drug-discovery and disease researchers to perform whole-genome screening at a 
lower cost and with higher reproducibility. 

This is an important step toward our release of the first whole human-genome microarray product, which 
is expected to be available for order before the end of the year," said Barney Saunders, vice president 
and general manager of Agilent's BioResearch Solutions Unit. " Customers have long wanted a one- 
sample, one-chip format with the increased sensitivity associated with 60-mer probes. The cost savings 
and high-quality performance make this product a compelling alternative for scientists who make their 
own microarrays." 

Agilent's microarrays are based on the industry-standard 1" x 3" (25mm x 75mm) format, which is 
compatible with most commercial microarray scanners. All Agilent commercial microarrays are developed 
using content from public databases and proprietary sources, with full sequence and annotation 
information made available to customers. Gene sequences for probes are developed using algorithms 
and then validated empirically through iterative wet-lab testing procedures. The result is a microarray 
comprised of functionally validated probes, with the most up-to-date and comprehensive genome 
information commercially available. 

Advantages of the double-density format include: 



• Lower cost. Not only is one microarray less expensive than two, it requires fewer reagents and 
reduces instrumentation demands. 

• Streamlined workflow. Researchers need prepare and process only one microarray instead of 
two. This also results in fewer steps in the subsequent data analysis. 

• Greater reproducibility. Use of a single microarray further reduces unnecessary variability in 
experimental conditions. 

• Smaller sample use. A smaller quantity of sample material is required to perform an experiment. 
Availability 

Agilent's Whole Human Genome Microarray is expected to be available for order by the end of the year. 
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About Agilent Technologies 

Agilent Technologies Inc. (NYSE: A) is a global technology leader in communications, electronics, life 
sciences and chemical analysis. The company's 30,000 employees serve customers in more than 110 
countries. Agilent had net revenue of $6 billion in fiscal year 2002. Information about Agilent is available 
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on the Web at www.aailent.com . 



Forward-Looking Statements 

This news release contains forward-looking statements (including, without limitation, statements relating 
to Agilent's expectation that its whole-genome microarray platform will be available for order before the 
end of 2003) that involve risks and uncertainties that could cause results to differ materially from , 
management's current expectations. These and other risks are detailed in the company's filings with the 
Securities and Exchange Commission, including its Annual Report on Form 10-K for the year ended Oct. 
31, 2002, its Quarterly Report on Form 10-Q for the quarter ended July 31, 2003 and its Current Report 
onForm 8-K filed Aug. 18, 2003. The company assumes no obligation to update the information in this 
press release. 
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Affymetrix Announces Commercial Launch of Single Array for Human Genome 
Expression Analysis 




AFFYMETRIX GENECHIP(R) BRAND HUMAN GENOME U133 PLUS 2.0 ARRAY 



Affymetrix GeneChip(R) Brand Human Genome U133 Plus 2.0 Array. 
(PRNewsFoto)[AS] 



SANTA CLARA, CA USA 10/02/2003 



( W<bstt« ^ 



More Than 1 Million Probes Analyze Expression Levels of Nearly 50,000 RNA 
Transcripts and Variants on a Single Array the Size of a Thumbnail 

SANTA CLARA, Calif., Oct. 2 /PRNewswire/ Affymetrix, Inc., 
(Nasdaq: AFFX) announced today that it is taking orders for its new 
GeneChip(R) brand Human Genome U133 Plus 2.0 Array, offering researchers the 
protein-coding content of the human genome on a single commercially available 
catalog microarray. The HG-U133 Plus 2.0 Array analyzes the expression level 
of nearly 50,000 RNA transcripts and variants with 22 different probes per 
transcript, providing superior data quality unmatched by technologies using a 
single probe per transcript. 

(Photo : http://www.newscom.com/cgi-bin/pmh/20031002/SFTH021 ) 

"With about 1.3 million probes on a chip the size of a human thumbnail, 
the Human Plus Array represents a leap in array technology data capacity, and 
further demonstrates the unique power and potential of our technology to 
explore vast areas of the genome," said Trevor J. Nicholls, Ph.D., Chief 
Commercial Officer. "Multiple independent measurements for each transcript 
ensure that our data quality remains the industry standard, even as our data 
capacity increases dramatically." 

The HG-U133 Plus 2.0 Array, which will ship in October, combines the 
content of the previous HG-U133 two-array set with nearly 10,000 new probe 
sets representing about 6,500 new genes, for a total of nearly 50,000 RNA 
transcripts and variants. This new information, verified against the latest 
version of the publicly available genome map, provides researchers the most 
comprehensive and up-to-date genome-wide gene expression analysis. The probe 
design strategy of the HG-U133 Plus 2.0 Array is identical to the previous HG- 
U133 Set, providing very strong data concordance between the two products. 
With more than double the data capacity of the previous-generation Affymetrix 
human product, the HG-U133 Plus 2.0 Array can significantly cut processing and 
analysis time for scientists in the lab, freeing up valuable resources and 
accelerating research. 

The HG-U133 Plus 2.0 Array sets a new standard for the number of genes and 
transcripts on any commercially available single array for human gene 
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expression analysis, while maintaining Affymetrix' unrivaled data quality. The 
HG-U133 Plus 2.0 Array uses 22 independent measures to detect the 
hybridization of each transcript on the array, 1.3 million data points in all, 
more than 3 0 times that of any other microarray technology. Using multiple, 
independent measurements provides optimal sensitivity and specificity, and the 
most accurate, consistent and statistically significant results possible.. . 

"More data points produce more reliable results and ultimately, enable 
better science," said Nicholls. "Our powerful probe set strategy gives our 
customers the assurance that their array results actually reflect what's in 
their sample." 

Affymetrix is also launching an updated 11-micron version of its popular 
18 -micron HG-U133A Array called the GeneChip HG-U133A 2.0 Array. The reduced 
feature size on this new design means researchers can use smaller sample 
volumes than on the previous 18-micron array without compromising performance. 
This new array represents over 20,000 transcripts that can be used to explore 
human biology and disease processes. All probe sets represented on the 
original GeneChip HG-U133A Array are identically replicated on the GeneChip 
HG-U133A 2.0 Array. 

More information on the design of the HG-U133 Plus 2.0 Array and the 
HG-U133A 2.0 Array may be found on the Affymetrix website at 
http ://www.affymetrix.com . 

Affymetrix will be presenting further information on this and other 
products at the BioTechnica trade show in Hanover, Germany on Oct. 7-9, 2003. 
The Company will also hold a press conference on Oct. 7, from 11 a.m. to 
12 p.m. at the show regarding the new Human Genome U133 Plus 2.0 Array. If you 
would like to attend this press conference, please contact Caroline Stupnicka 
at c.stupnicka@northbankcommunications.com. 

About Affymetrix: 

Affymetrix is a pioneer in creating breakthrough tools that are driving 
the genomic revolution. By applying the principles of semiconductor technology 
to the life sciences, Affymetrix develops and commercializes systems that 
enable scientists to improve the quality of life. The Company's customers 
include pharmaceutical, biotechnology, agrichemical , diagnostics and consumer 
products companies as well as academic, government and other non-profit 
research institutes. Affymetrix offers an expanding portfolio of integrated 
products and services, including its integrated GeneChip platform, to address 
growing markets focused on understanding the relationship between genes and 
human health. Additional information on Affymetrix can be found at 
http://www.affymetrix.com . 

All statements in this press release that are not historical are 
"forward-looking statements" within the meaning of Section 21E of the . 
Securities Exchange Act as amended, including statements regarding Affymetrix 1 
"expectations," "beliefs," "hopes," "intentions," "strategies" or the like. 
Such statements are subject to risks and uncertainties that could cause actual 
results to differ materially for Affymetrix from those projected, including, 
but not limited to risks of the Company's ability to achieve and sustain 
higher levels of revenue, higher gross margins, reduced operating expenses, 
uncertainties relating to technological approaches, manufacturing, product 
development, market acceptance (including uncertainties relating to product 
development and market acceptance of the GeneChip HG-U133 Human Plus 2.0 Array 
and the HG-U133A 2.0), personnel retention, uncertainties related to cost and 
pricing of Affymetrix products, dependence on collaborative partners, 
uncertainties relating to sole source suppliers, uncertainties relating to FDA 
and other regulatory approvals, competition, risks relating to intellectual 
property of others and the uncertainties of patent protection and litigation. 
These and other risk factors are discussed in Affymetrix' Form 10-K for the 
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year ended December 31, 2002 and other SEC reports, including its Quarterly 
Reports on Form 10-Q for subsequent quarterly periods. Affymetrix expressly 
disclaims any obligation or undertaking to release publicly any updates or 
revisions to any forward-looking statements contained herein to reflect any 
change in Affymetrix 1 expectations with regard thereto or any change in 
events, conditions, or circumstances on which any such statements are based. 

NOTE: Affymetrix, the Affymetrix logo, and GeneChip and are registered 
trademarks owned or used by Affymetrix, Inc. 



SOURCE Affymetrix, Inc. 

Web Site: http://www.affymetrix.com 

Photo Notes: NewsCom: 

http://www.newscom.com/cgi-bin/prnh/20031002/SFTH021 AP Archive: 
http://photoarchive.ap.org PRN Photo Desk, 
photodesk@prnewswire.com 
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The third enactment of Cambridge 
Healthtech Institute's Macroresults 
through Microarrays meeting was held 
in Boston (MA, USA) from 29 April- 
1 May 2002. The subtheme of this year's 
meeting was 'advancing drug discov- 
ery', a widely touted application for 
array technology. 

The evolution of microarrays 

If you were asked 'Who first conceived 
of the idea of microarrays', who would 
come to mind? Mark Schena perhaps, 
first author of the seminal 1995 paper 
on cDNA arrays [1]? Maybe Pat Brown, 
Schena's then supervisor? Or perhaps 
Stephen Fodor, the primary driver 
behind Affymetrix's (http://www. 
affymetrix.com) oligonucleotide-based 
platform [2]. Brits might even chant the 
name of Ed Southern [3]. Well, accord- 
ing to Roger Ekins (University College 
London Medical School; http://www. 
ucl.ac.uk/medicine/) all these answers 
would be wrong. It was in fact Ekins 
and his colleagues who first conceived 
of and patented 'a new generation of 
ultrasensitive, miniaturized assays for 
protein and DNA-RNA measurement 
based on the use of microarrays' in the 
mid 1980s [4]. The concept and poten- 
tial of array technology was more fully 
described in a later publication, in 
which Ekins et at. [5] concluded that an- 
tibody microspots of -50 u.m 2 could be 
achieved, and that as many as 2 million 
different immunoassays could, in prin- 
ciple, be accommodated on a surface 
area of 1 cm 2 . 

Technological innovation 

In practice, it took a different biological 

molecule (DNA), a different research 



group, and a leap into microfabri- 
cation technology to even begin 
approaching these kinds of densities 
[Affymetrix patent 6045996 talks of 
one million spots cm- 2 ]. Of course, 
advancing technology is one of the 
driving engines behind the genomics 
juggernaut, and we are already seeing 
'4th generation' machines for fab- 
ricating DNA chips. If the company 
representatives at this meeting are to 
be believed (and their cases seemed 
strong), spotting is out, and in situ 
fabrication of oligonucleotide-based 
'iterative custom arrays' is in. Whether 
you go with the Combimatrix's (http:// 
www.combimatrix.com) electrochemi- 
cally directed synthesis and detection 
system, febit's (http://www.febit.com) 
Geniom® technology, or Nimblegen's 
(http://www.nimblegen.com) Maskless 
Array Synthesizer technology is a 
matter of personal choice. However, 
each of these machines provides the 
flexibility to design variable length 
oligonucleotide probes from se- 
quences inputted by the user, and then 
perform in situ synthesis of an array. 
Each system also boasts unique advan- 
tages. For example, Combimatrix's 
biological array processor is a semi- 
conductor coated with a 3D layer 
of porous material in which DNA, 
RNA, peptides or small molecules 
can be synthesized or immobilized 
within discrete test sites, while febit's 
Ce niom One® is a fully integrated 
gene-expression analysis system with 
minimal user hands-on time - the 
probe sequences are programmed, the 
RNA samples inserted, and the gene 
expression data is pumped out a few 
hours later. 



Cell- and tissue-based arrays 
Array technology is in most people's 
minds firmly linked with gene-expression 
profiling. Fewer are aware that cell- and 
tissue-based arrays have been devel- 
oped, and how they can provide 
a vital extra dimension to research. In 
support of this, Barry Bochner gave an 
update on the cell-based array system 
that Biolog (http://www.biolog.com) 
has produced for simultaneously mea- 
suring the effects of one gene in the cell 
under thousands of growth conditions 
(see [6] for further details). David Walt 
(Tufts University; http:// www. tufts, 
edu/) is developing single live cell ar- 
rays using optical imaging fiber (OIF) 
technology. An array of microwells is 
fabricated on the face of an OIF at den- 
sities of up to 10 million wells cm* 2 . 
Cells are then added to the wells and 
disperse at an average of one cell per 
well. Physiological and genetic re- 
sponses of each cell are measured via 
fluorescence produced by reporter 
genes (e.g. /ocZ, gfp. Assays performed 
so far include yeast live or dead cell 
assay, microenvironment pH and 
0 2 measurements, promotor responses 
using the lad and phoA reporter genes, 
and protein-protein interactions using 
the yeast two-hybrid system. The main 
advantage of this system is that the cells 
remain alive during the assay, which 
means a real-time timecourse can be 
performed and/or the array passed 
from sample to sample. This would be 
useful in, for example, the scanning of 
a combinatorial drug library for specific 
physiological effects. 

Tissue arrays are a useful complemen- 
tary technology to DNA arrays because 
they can be used to help validate and 
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understand the biological and medical 
significance of gene changes discov- 
ered using standard DNA arrays. For 
example, an array of tumor tissues can 
be screened for the protein (using im- 
munohistochemistry), message (using 
in situ hybridization) and copy number 
(using comparative genomic hybridiza- 
tion) of a gene of interest, to determine 
if expression of the gene (or lack 
thereof) is related in any way to sur- 
vival. They can also be used to predict 
the probability of clinical failure of lead 
compounds as a result of toxicity by 
evaluating the distribution of the drug 
targets in normal tissue. Spyro Mousses 
and his co-workers at the National 
Human Genome Research Institute 
(http://www.nhgri.nih.gov/index.html) 
have built such arrays, including a 
multi-tumor array (-5000 specimens, 
and sections from 36 normal and 800 
metastatic tissues) and a normal tissue 
array (76 tissue and 332 cell types). 

The problem with proteins 

It has been said that genomics tells us 
what might happen, transcriptomics 
indicates what should happen, and pro- 
teomics shows what is happening. The 
impact of functional proteomics on 
pharmaceutical R&D is rapidly increas- 
ing, and protein arrays are being used 
increasingly in both basic and applied 
research. Their use lies not only in com- 
parative protein expression and inter- 
action profiling, but also in diagnostics 
and drug discovery. However, an in- 
creasing number of researchers have 
found that protein arrays, like their 
cousins the DNA arrays, present several 
practical obstacles relating to their pro- 
duction and use. For example, in using 
Escherichia coli to produce recombi- 
nant eukaryotic proteins from a single 
expression vector, multiple protein 
products are often produced, suggest- 
ing mixes of truncated or otherwise 
altered proteins. There is also the obvi- 
ous concern that the proteins might 
not be modified in a similar manner to 



eukaryotic systems. Also, an optimal 
method for depositing and binding 
proteins to the selected substrate is 
yet to be determined, as is the best 
way to ensure that they are bound in a 
correctly folded, active conformation. 

Several companies have been address- 
ing these problems. Prolinx (http:// 
www.prolinxinc.com) is one such com- 
pany, and Karin Hughes described their 
Versalinx™ chemistry for producing 
protein, peptide and small-molecule 
arrays. Versalinx™ uses solution-phase 
conjugation followed by immobiliza- 
tion, resulting in functional orientation 
of proteins and peptides on the sub- 
strate surface. It also offers the valuable 
additional benefit of exhibiting low 
non-specific binding. Sense Proteomic 
(http://www.senseproteomic.com) is 
also among those addressing these 
problems to develop robust protein 
arrays for drug discovery and clinical 
applications and has developed func- 
tional protein array formats based on 
specific disease tissues. Subtractive hy- 
bridization is used to identify genes 
with altered expression in breast tumor 
and cystic fibrosis compared to normal 
tissue. A high throughput cloning strat- 
egy (COVET™) is then used to produce 
libraries of genes that are tagged, 
cloned, expressed, purified and finally 
immobilized on glass slides. Initial vali- 
dation studies have shown that the vast 
majority of the immobilized proteins do 
indeed retain biological function. 

Stefan Schmidt and his company 
(CPC Biotech; http://www.gpcbiotech. 
de) have moved past the platform devel- 
opment stage and, with their focus 
firmly on drug discovery, are currently 
developing kinase-profiling arrays. 
Kinases are important targets for phar- 
maceutical drug discovery and therapy, 
and CPC's aim is to simultaneously de- 
tect multiple kinases, obtain activity pro- 
files for different cell types, or analyze 
the ability of drug candidates to inhibit 
kinase activity. To do this, recombinant 
kinase substrates are immobilized on 



membranes, incubated with purified 
kinase, and the substrates measured for 
the degree of phosphorylation. 

Summary 

Meetings like this, packed with exciting 
discoveries and intriguing and interest- 
ing innovation, heavily emphasize the 
pace at which biotechnology is advanc- 
ing, to the extent that the number of 
options for genomic and proteomic re- 
searchers can become overwhelming. 
Although data analysis is perhaps the 
greatest current concern for array users, 
an increasing challenge will be to deter- 
mine the approaches and technology 
that really work, and to do it in a timely 
manner. 
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A standard two-dimensional (2-D) protein map of Fischer 344 rat liver 
(F344MST3) is presented, with a tabular listing of more than 1200 protein species. 
Sodium dodecyl sulfate (SDS) molecular mass and isoelectric point have been es- 
tablished, based on positions of numerous internal standards. This map has been 
used to conned and compare hundreds of 2-D gels of rat liver samples from a va- 
riety of studies, and forms the nucleus of an expanding database describing rat 
liver proteins and their regulation by various drugs and toxic agents. An example 
of such a study, involving regulation of cholesterol synthesis by cholesterol-lower- 
ing drugs and a high-cholesterol diet, is presented. Since the map has been ob- 
tained with a widely used and highly reproducible 2-D gel system (the Iso-DaJt ? 
system), it can be directly related to an expanding body of work in other laborato- 
ries. 
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1 Introduction 

High-resolution two-dimensional electrophoresis of pro- 
teins, introduced in 1975 by OTarreil and others [1-4], has 
been used over the ensuing 16 years to examine a wide va- 
riety of biological systems, the results appearing in more 
than 5000 published papers. With the advent of computer- 
ized systems for analyzing two-dimensional (2-D) gel ima- 
ges and constructing spot databases, it is also possible to 
plan and assemble integrated bodies of information de- 
scribing the appearance and regulation of thousands of pro- 
tein gene products [5, 6]. Creating such databases involves 
amassing and organizing quantitative data from thousands 
of 2-D gels, and requires a substantial commitment in tech- 
nology and resources. 

Given the long-term effort required to develop a protein da- 
tabase, the choice of a biological system takes on consider- 
able importance. While in vitro systems are ideal foranswer- 
ing many experimental questions, especially in cancer re- 
search and genetics, our experience with cell cultures and 
tissue samples suggests that some in vivo approaches could 
have major advantages. In particular, we have noticed that 
liver tissue samples from rats and mice appearto show grea- 
ter quantitative reproducibility (in terms of individual pro- 
tein expression) than replicate ceil cultures. This is perhaps 
a natural result of the homeostasis maintained in a com- 
plete animal vi. the well-known variability of cell cultures, 
the latter due principally to differences in reagents (e.g., 
fetal bovine serum), conditions {e.g., pH) and genetic "evo- 
lution" of cell lines while in culture. It is also more difficult 
to generate adequate amounts of protein from cell culture 
systems (particularly with attached cells), forcing the inves- 
tigator to resort to radioisotope-based or silver-based stain- 
detection methods. While these methods are more sensi- 
tive (sometimes much more sensitive) than the Coomassie 
Brilliant Blue (CBB) stain typically used for protein detec- 
tion in "large" protein samples, they are generally more vari- 
able, more labor-intensive and, in the case of radiographic 
methods, may generate highly "noisy" images, due to the 
properties of the films used. By contrast, large protein sam- 
ples can easily be prepared from liver using urea/Nonidet 
P-40 (NP-40) solubilization and stained with CBB, which 
has the advantage of being easily reproducible [8]. Finally, 
there remains the question of the "truthfulness" of many in 
vitro systems as compared to their in vivo analogs; how 
great are the changes caused by the introduction into a cul- 
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ture and the associated shift to strong selection for growth, 
and how do these affect experimental outcomes? Hence 
the apparent advantages of in vitro systems, in terras of ex- 
perimental manipulation, may be counterbalanced by 
other factors relating to 2-D data quality. 

There is a second important class of reasons for exploring 
the use of an in vivo biological system such as the liver. His- 
torically there have been two broad approaches to the me- 
chanistic dissection of biochemical processes in intact cel- 
lular systems: genetics (a search for informative mutants) 
and the use of chemical agents (drugs and chemical toxins). 
Both approaches help us to understand complex systems 
bv disrupting some specific functional element and show, 
ing us the result. With the development of techniques for 
genetic manipulation and cloning, the genetic approach 
can be effectively applied either in vitro or in vivo, although 
the in vitro route is usually quicker. The chemical approach 
can also be applied to either sort of biological system; here, 
however, the bulk of consistently acquired information is 
in experimental animals (rats and mice). While most biolo- 
gists know a short list of compounds having specific, experi- 
mentally useful effects (e.g., inhibitors of protein synthesis, 
ionophores, polymerase inhibitors, channel blockers, nu- 
cleotide analogs, and compounds affecting polymerization 
of cytoskeletal proteins), there is a much larger number of 
interesting chemically-induced effects, most of them char- 
acterized by toxicologists and pharmacologists in rodent 
systems. Just as a thorough genetic analysis would involve 
saturating a genome with mutations, it is possible to ima- 
gine a saturating number of drugs, the analysis of whose ac- 
tions would reveal the complete biochemistry of the cell. 
While organized drug discovery efforts usually target spe- 
cific desired effects, the nature of the process, with its de- 
pendence on screening large numbers of compounds, ne- 
cessarily produces many unanticipated effects. It is there- 
fore reasonable to suppose that the required broad range of 
compounds necessary to achieve "biochemical saturation" 
may be forthcoming; in fact, it may already exist among the 
hundreds of thousands of compounds that failed to qualify 
as drugs. 

Among organs, the liver is an obvious choice for the study 
of chemical efTects because of its well-known plasticity and 
responsiveness. The brain appears to be quite plastic (e.g. 
[7]), but it is a complicated mixture of cell types requiring 
skillful dissection for most experiments. The kidney, while 
quite responsive, also presents a potentially confounding 
mixture of cell types. The liver, by contrast, is made up of 
one predominant cell type which is easy to solubilize: the 
hepatocyte, representing more than 95% of its mass. Most 
importantly, the liver performs many homeostatic func- 
tions that require rapid modulation of gene expression.lt 
appears that most chemical agents tested alTect gene ex- 
pression in the liver at some dosage (N. Leigh Anderson, 
unpublished observations), an interesting contrast to our 
earlier work with lymphocytes, for example, which seem to 
be much less responsive. Such results conform to the expec- 
tation that cells with a homeostatic, physiological role 
should be more plastic than cells differentiated for a pur- 
pose dependent on the action of a limited number of spe- 
cific genes. 

The liver also allows the parallels between in vitro and in 
vivo systems to be examined in detail. Significant progress 



has been made in the development of mouse, rat and hu- 
man hepatocyte culture systems, as well as in precision-cut 
tissue slices. Using such an array of techniques, it is possi- 
ble to assemble a matrix of mammalian systems including 
mouse and rat in vivo on one level and mouse, rat and hu- 
man in vitro on a second level, and to compare efTects be- 
tween species and between systems. This approach allows 
us to draw informed conclusions regarding the biochemical 
"universality" of biological responses among the mammals, 
and to offer some insight into the validity of in vitro ap- 
proaches for toxicoiogical screening. We believe this data 
will be necessary if in vitro alternatives are to achieve wide 
usage in government-mandated safety testing of drugs, con- 
sumer products and industrial and agricultural chemicals. 

A number of interesting studies have been published using 
2-D mapping to examine effects in the rodent liver. A num- 
ber of investigarors have made use of the technique to 
screen for existing genetic variants 18—1 1] or induced muta- 
tions [12-14], mainly in the mouse. This work builds on the 
wealth of genetic information available on the mouse and 
its established position as a mammalian mutation-detec- 
tion system. While some studies of chemical effects have 
been undertaken in the mouse [15-17], most have used the 
rat [18-23], The examination of the cytochrome p-450 sys- 
tem, in particular, has been carried out almost exclusively 
on the rat [24, 25]. 

These considerations lead us to conclude that rodent liver 
offers the best opportunity to systematically examine an 
array of gene regulation systems, and ultimately to build a 
predictive model of large-scale mammalian gene control. 
The basic underlying foundation of such a project is a reli- 
able, reproducible master 2-D pattern of liver, to which on- 
going experimental results can be referred. In this paper, we 
report such a master pattern for the acidic and neutral pro- 
teins of rat liver (pattern F344MST3).In future, this master 
will be supplemented by maps of basic proteins,and analog- 
ous maps of mouse and human liver. 



2 Materials and methods 
2.1 Sample preparation 

Liver is an ideal sample material for most biochemical stud- 
ies, including 2-D analysis. A sample is taken of approxima- 
tely 0.5 g of tissue from the apical end of the left lobe of the 
liver. Solubilization is effected as rapidly as practical; a 
delay of 5-15 min appears to cause no major alteration in 
liver protein composition if the liver pieces are kept cold 
(e.g., on ice) in the interim. In the solubilization process, 
the liver sample is weighed, placed in a glass homogenizer 
(e.g., 15 mLWheaton); 8 volumes of solubilizing solution* 

* The solubilizing solution is composed of 2% NP-40 (Sigma),9 m urea 
(analytical grade, e.g., BDH or Bio-Rad), 0.5% dithiothreitol (DTT; 
Sigma) and 2 % carrier ampholytes (pH 9-11 LKB: these come as a 20% 
stock solution, so 2% final concentration is achieved by making the final 
solution 10% 9-11 Ampholine by volume). A large batch of solubilizer 
(several hundred mU) is made and stored frozen at-80°C in aliquois 
sufficient to provide enough for one day's estimated sample prepara- 
tion requirement. The solution is never allowed to become warmer 
than room temperature at any stage during preparation or thawing for 
use, since heating of concentrated urea solutions can produce contami- 
nants that covalenlly modify proteins producing artifactual charge 
shifts. Once thawed, any unused solubilizer is discarded. 
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is added {i.e., 4 mL per 0.5 g tissue) and the mixture is ho- 
mogenized using first the loose- and then then the tight-fit- 
ting glass pestle. This takes approximately 5 strokes with 
each pestle and is carried out at room temperature because 
urea would crystallize out in the cold. Once the liver sample 
is thoroughly homogenized in the solubilizer. it is assumed 
that all the proteins are denatured (by the chaotropic effect 
of the urea and NP-40 detergent) and the enzymes inacti- 
vated by the high pH (-9.5). Therefore these samples may 
be kept at room temperature until they can be centrifuged 
or frozen as a group (within several hours of preparation). 
The samples are centrifuged for 6 X 10* g min {e.g., 500 000 
X g for 12 min using a Beckman TL-100 centrifuge). The 
centrifuge rotor is maintained at just below room tempera- 
ture {e.g., 15-20 °C), but not too cold, so as to prevent the 
precipitation of urea. The centrifuge of choice is a Beckman 
TL-100 because of the sample tube sizes available, but any 
ultracentrifuge accepting smallish tubes will suffice. When 
an appropriate centrifuge is not available near the site of 
sample preparation, samples can be frozen at -80°C and 
thawed prior to centrifugation and collection of superna- 
lants. Each supernatant is carefully removed following cen- 
trifugation and aiiquoted into at least 4 clean tubes for stor- 
age. This is done by transferring all the supernatant to one 
clean tube, mixing this gently (to assure homogeneous 
composition) and then dividing it into 4 aliquots. The aJi- 
quots are frozen immediately at — 80°C. These multiple ali- 
quots can provide insurance against a failed run or a freezer 
breakdown. 

2.2 Two-dimensional electrophoresis 

Sample proteins are resolved by 2-D electrophoresis using 
the 20 X 25 cm Iso-Dalt* 2-D gel system ([26-29]; pro- 
duced by LSB and by Hoefer Scientific Instruments, San 
Francisco) operating with 20 gels per batch. All first-dimen- 
sional isoelectric focusing (IEF) gels are prepared using the 
same single standardized batch of carrier ampholytes 
(BDH 4-8A in the present case, selected by LSB's batch- 
testing program for rat and mouse database work**). A 10 
uL sample of solubiiized liver protein is applied to each gel, 
and the gels are run for 33 000 to 34 500 volt-hours using a 
progressively increasing voltage protocol implemented by 
a programmable high-voltage power supply. An Ange- 
lique*" computer-controlled gradient-casting system (pro- 
duced by LSB) is used to prepare second-dimensional sod- 
ium dodecyl sulfate (SDS) polyacrylamide gradient slab 
gels in which the top 5 % of the gel is 1 1 %T acrylamide, and 
the lower 95 % of the gel varies linearly from 1 1 % to 1 8 °/oT. 

This system has recently been modified so as to employ a 
commercially available 30.8 %T acrylamide/ A^-methyle- 
nebisacrylamide prepared solution (thus avoiding the han- 
dling of the solid acrylamide monomer) and three addi- 
tional stock solutions: buffer (made from Sigma pre-set 
Tris), persulfate and A^A^AMetramethylethylenedi- 
arnine (TEMED). Each, gel is identified by a computer- 
printed filter paper label polymerized into the lower left cor- 
ner of the gel. First-dimensional IEF tube gels are loaded 



** This material (succeeding certified batches of which are available from 
Hoefer Scientific Instruments) has the most linear pH gradient pro- 
duced by any ampholyte tested except for the Pharmacia wide range 
(which has an unacceptable tendency to bind high-molecular weight 
acidic proteins, causing them to streak). 



directly (as extruded) onto the slab gels without equilibra- 
tion, and held in place by polyester fabric wedges (Wed- 
gies*, produced by LSB) to avoid the use of hot agarose. 
Second-dimensionaJ slab gels are run overnight, in groups 
of 20, in cooled DALT tanks (10°C) with buffer circulation. 
All run parameters, reagent source and lot information, 
and notations of deviation from expected results are ente- 
red by the technician responsible on a detailed, multi-page 
record of the experiment. 

2.3 Staining 

Following SDS-electrophoresis, slab gels are stained for 
protein using a colloidal Coomassie Blue G-250 procedure 
in covered plastic boxes, with 10 gels (totalling approxima- 
tely 1 L of gel) per box. This procedure (based on the work 
of NeuhofT[30, 33]) involves fixation in 1.5 L of 50°/o etha- 
nol and 2% phosphoric acid for 2 h, three 30 min washes, 
each in 2 L of cold tap water, and transfer to 1 .5 L of 34 % 
methanol, 17% ammonium sulfate and 2% phosphoric acid 
for 1 h, followed by the addition of a gram of powdered Coo- 
massie Blue G-250 stain. Staining requires approximately 4 
days to reach equilibrium intensity, whereupon gels are 
transferred to cool tap water and their surfaces rinsed to re- 
move any particulate stain prior to scanning. Gels may be 
kept for several months in water with added sodium azide. 
The water washes remove ethanol that would dissolve the 
stain (and render the system noncolloidal, with high back- 
grounds). The concentrated ammonium sulfate and meth- 
anol solution is diluted by equilibration with the water vol- 
ume of the gels to automatically achieve the correct final 
concentrations for colloidal staining. Practical advantages 
of this staining approach can be summarized as follows: (i) 
the low, flat background makes computer evaluation of 
small spots (max OD < 0.02) possible, especially when 
using laser densitometry; (ii) up to 1500 spots can be reli- 
ably detected on many gels {e.g., rat liver) at loadings low 
enough to preserve excellent resolution; and (iii) reprodu- 
cibility appears to be very good: at least several hundred 
spots have coefficients of reproducibility less than 15%. 
This value is at least as good as previous CBB methods, and 
significantly better than many silver stain systems. 

2.4 Positional standardization 

The carbamylated rabbit muscle creatine phosphokinase 
(CPK) standards [32] are purchased from Pharmacia and 
BDH. Amino acid compositions, and numbers of residues 
present in proteins used for internal standardization, are 
taken from the Protein Identification Resource (PIR) se- 
quence database [33]. 

2.5 Computer analysis 

Stained slab gels are digitized in red light at 134 micron re- 
solution, using either a Molecular Dynamics laser scanner 
(with pixel sampling) or an Eikonix 78/99 CCD scanner. 
Raw digitized gel images are archived on high-density DAT 
tape (or equivalent storage media) and a greyscale video- 
print prepared from the raw digital image as hard-copy 
backup of the gel image. Gels are processed using the Kep- 
ler® software system (produced by LSB), a commercially 
available workstation-based software package built on 
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some of the principles of the earlier TYCHO system [34- 
41] Procedure PROC008 is used to yield a spotlist giving 
position shape and density information for each detected 
sp™is procedure makes use of digital Hit enng mathe- 
matical morphology techniques and Digital masking to re- 
move'he background, and uses full 2-D least-squares opti- 
mfetion to refine the parameters of a 2-D Gaussian shape 
h^Z%*Koc*ssin* parameters and file logons arc 
stored in a relational database, while various log files detail- 
ing operation of the automatic analysis software are ar- 
chived with the reduced daia.The computed resolution and 
level of Gaussian convergence of each gel are inspected 
and archived for quality control purposes. 

Experiment packages are constructed using the Kepler ^ex- 
periment definition database to assemble groups of 2-D 
patterns corresponding to the experimental groups (eg 
Treated and control animals). Each 2-D pattern is matched 
to the appropriate "master* 2-D pattern (pattern 
F344MST3 in the case of Fischer 344 rat liver), thereby 
providing linkage to the existing rodent protein 2-D data- 
bases The software allows experiments containing hun- 
dreds of gels to be constructed and analyzed as a unit, with 
up to 100 gels displayed on the screen at one time for com- 
parative purposes and multiple pages to accommodate ex- 
periments of > 1000 gels. For each treatment, proteins 
showing significant quantitative differences w. appropriate 
controls are selected using group-wise statistical par arne- 
ters (e g Student's t-test', Kepler* procedure STUDENT). 
Proteins satisfying various quantitative criteria (such as P< 
0 001 difference from appropriate controls) are repre- 
sented as highlighted spots onscreen or on computer-plot- 
led protein maps and stored as spot populations (i.e., logi- 
cal vectors) in a liver protein database. Quantitative data 
(spot parameters, statistical or other computed values) are 
stored as real-valued vectors in the database. Analysis of co- 
regulation is performed using a Pierson product-moment 
correlation (Kepler procedure CORREL) to determine 
whether groups of proteins are coordinated regulated by 
any of the treatments. Such groups can be presented graphi- 
cally on a protein map, and reported together with the statis- 
tical criteria used to assess the level of coregulation. Multi- 
variate statistical analysis (e.g., principal components' ana- 
lysis) is performed on data exported to SAS (SAS Institute). 

2.6 Graphical data output 

Graphical results are prepared in GKS and translated 
within Kepler* into output for any of a variety of devices. 
Linedrawing output is typically prepared as Postscript and 
printed on an Apple LaserWriter. Detailed maps presented 
here have been generated using an ultra-high-resolution 
Postscript-compatible Linotronic output device. Greyscale 
graphics are reproduced from the workstation screen using 
a Seikosha videoprinter. Patterns are shown in the standard 
orientation, with high molecular mass at the top and acidic 
proteins to the left. 
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ceuticals, ground and mixed with the diet at concentrations 
of 0.075% and 1%, respectively. The high cholesterol diet 
was Purina 5801M-A (5% cholesterol plus 1% sodium etio- 
late in the control diet). Animal work was carried out by Mi- 
crobiological Associates (Betbesda,MD). Animals were ac- 
climatized for one week on the control diet, fed test or con- 
trol diets for one week, and sacrificed on day 8. Average 
daily doses of lovastatin and cholestyramine in appropriate 
groups were 37 mg/kg/day and 5 g/kg/day, respectively, 
based on the weight of the food consumed. Liver samples 
were collected and prepared for 2-D electrophoresis accord- 
ing to the standard liver protocol (homogenization in 8 
volumes of 9 m urea, 2% NP-40, 0.5% ditbiothreitol, 2% 
LKB pH 9-11 carrier ampholytes, followed by centrifuga- 
tion for 30 min at 80000 X g). Kidney, brain and plasma 
samples were frozen. Gels were run as described above, 
and the data was analyzed using the Kepler* system. Gels 
were scaled, to remove the effect of differences in protein 
loading, by setting the summed abundances of a large num- 
ber of matched spots equal for each gel (linear scaling). 



2.7 Experiment LSBC04 

In the study described here 12-week-old Charles River 
male F344 rats were used. Diets were prepared at LSB, 
based on a Purina 5755M Basal Purified Diet. Lovastatin 
and cholestyramine were obtained as prescription pharma- 



3 Results and discussion 

3.1 The rat liver protein 2-D map 

F344MST3 is a standard 2-D pattern of rat liver proteins, 
based on the Fischer 344 strain. This pattern was initiated 
from a single 2-D gel and extensively edited in an experi- 
ment comparing it to a range of protein loads, so as to in- 
clude both small spots and well-resolved representations of 
high-abundance spots. More than 700 rat liver 2-D patterns 
have been matched to F344MST3 in a series of drug effects 
and protein characterization experiments, and numerous 
new spots (induced by specific drugs, for instance) have 
been added as a result. A modified version including addi- 
tional spots present in the Sprague-Dawley outbred rat has 
also been developed (data not shown). Figure 1 shows a 
greyscale representation and Fig. 2 a schematic plot of the 
master pattern. More than 1200 spots are included, most of 
which are visible on typical gels loaded with 10 uLof solubi- 
lized liver protein prepared by the standard method and 
stained with colloidal Goomassie Blue. Master spot num- 
bers (MSN's) have been assigned to all proteins, and ap- 
pear in the following figures, each showing one quadrant of 
the pattern. Figure 3 shows the upper left (acidic, high 
molecular mass) quadrant, Fig. 4 the upper right (basic, 
high molecular mass) quadrant, Fig. 5 the lower left (acidic, 
low molecular mass) quadrant, and Fig. 6 the lower nght 
(basic, low molecular mass) quadrant. The quadrants over- 
lap as an aid to moving between them. The gel position (in 
100 micron units), isoelectric point (relative to the CPK in- 
ternal p7 standards) and SDS molecularmass (from the cali- 
bration curve in Fig. 8) are listed for each spot (Table 1). Be- 
cause of the precision of the CPK-p7 values, these parame- 
ters can be used to relate spot locations between gel sys- 
tems more reliably than using p/ measurements expressed 
as pH . A major objective of current studies is the identifica- 
tion of all major spots corresponding to known liver pro- 
teins, as well as rigorous definitions of subcellular orga- 
nelle'contents. Of particular interest to us is the parallel de- 
velopment of identifications in the rat and mouse liver 
maps allowing detailed comparisons of gene expression ef- 
fects in the two systems. The results of these studies will be 
presented systematically in a later edition of this database, 
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but we include here a useful series of 22 orienting idemifi- ^coordinate, a is 51 1.83, b is — 0 2 731 and c is 331838m Th* 
catmns as an a,d to o.herusers of the rat liverpauem (Table resulting fit appears to be fai^ good ovel "Sort SJJtf 

molecular mass. 



3.2 Carbamylated charge standards, computed p/s and 
molecular mass standardization 

We have previously shown that the use of a svstem of close- 
ly-spaced interna! pi markers (made by carbamviaiing a 
basic protein) offers an accurate and workable solution to 
the problem of assigning positions in the p] dimension [32]. 
The same system, based on 36 protein species made by car- 
bamylating rabbit muscle CPK, has been used here to as- 
sign pfs to most rat liver acidic and neutral proteins. The 
standards were coelectrophoresed with total liver proteins, 
and the standard spots added to a special version of the' 
master pattern F344MST3. The gel J-coordinates of all 
liver protein spots lying within the CPK charge train were 
then transformed into CPK pi positions by interpolation 
between the positions of immediately adjacent standards 
(Table 1) using a Kepler* vector procedure. 

It has proven possible to compute fairly accurate pi values 
for many proteins from the amino acid composition [42]. 
We have attempted here to test a further elaboration of this 
approach,in which we computed pfs for the CPK standards 
themselves, based on our knowledge of the rabbit muscle 
CPK sequence and the fact that adjacent members of the 
charge train typically differ by blockage of one additional ly- 
sine residue (Table 3). We compared these values to similar 
computed pPs for an additional set of carbamylated stand- 
ards made from human hemoglobin beta chains and a se- 
ries of rat liver and human plasma proteins of known posi- 
tion and sequence (Fig. 7,Table 4). The result demonstrates 
good concordance between these systems. Two proteins 
show significant deviations: liver fatty-acid binding protein 
(FABP; #1 in Table 4) and protein disulphide isomerase 
(#20 in the table). The FABP spot present on F344MST3 
may represent a charge-modified version of a more basic 
parent spot closer to the expected p/, not resolved in the 
1EF/SDS gel. Of particular importance is the fact that, by 
comparing computed pfs of sequenced but unlocated pro- 
teins with the CPK p/s, we can assign a probable gel loca- 
tion without making any assumptions regarding the actual 
gel pH gradient. This offers a useful shortcut, given the va- 
garies of pH measurement on small diameter IEF gels. We 
have used this approach to compute the CPK pTs of all rat 
and mouse proteins in the PIR sequence database, as an aid 
to protein identification (data not shown). 

In orderto standardize SDS molecular weight (SDS-MW), 
we have used a standard curve fitted to a series of identified 
proteins (Fig. 8). Rather than using molecular mass per se, 
we have elected to use the number of amino acids in the 
polypeptide chain, as perhaps a better indication of the 
length of the SDS-coated rod that is sieved by the second 
dimension slab. The resulting values were multiplied by 
112 (the weighted average mass of amino acids in se- 
quenced proteins) to give predicted molecular masses. Be- 
cause we use gradient slabs, we have not constrained the fit- 
ted curve to conform to any predetermined model; rather 
we tried many equations and selected the best using the 
program tt Tablecurve"on a PC. The equation chosen was v 
= £i + bx + c/x\ where y is the number of residues, xis the gel 



3.3 An example of rat liver gene regulation: Cholesterol 
metabolism 

Experiment LSBC04 was designed as a small-scale test of 
the regulation of cholesterol metabolism in vivo by three 
agents included in the diet: lovastatin (Mevacor®,an inhibi- 
tor of HMG-CoA reductase); cholestyramine (a bile acid 
sequestrant that has the effect of removing cholesterol 
from the gut-liver recirculation); and cholesterol itself. The 
first two agents should lower available cholesterol and the 
third should raise it, allowing manipulation of relevant 
gene expression control systems in both directions. Such 
an experiment offers an interesting test of the 2-D mapping 
system since most of the pathway enzymes are present in 
low abundance, many are membrane-bound and difficult 
to solubilize, and the pathway itself is complex. Approxima- 
tely 1000 proteins were separated and detected in liver ho- 
mogenates. Twenty-one proteins were found to be affected 
by at least one treatment, and these could be divided into 
several coregulated groups. 

3.3.1 MSN 413 (putative cytosolic HMG-CoA synthase) 
and sets of spots regulated coordinately or inversely 

One group of spots (including a spot assigned to the cyto- 
solic HMG-CoA synthase, MSN 413) showed the expected 
increase in abundance with lovastatin or cholestyramine, 
the synergistic further increase with lovastatin and choles- 
tyramine, and a dramatic decrease with the high cholesterol 
diet. Spot number 413 is the most strongly regulated pro- 
tein in the present experiment, showing a 5- to 10-fold in- 
duction aftera 1 week treatment with 0.075 °/o lovastatin and 
1 % cholestyramine in the diet (Figs. 9 and 10). Its expres- 
sion follows precisely the expectation for an enzyme whose 
abundance is controlled by the cholesterol level; it is pro- 
gressively increased from the control levels by cholestyra- 
mine, lovastatin and lovastatin plus cholestyramine, and it 
sinks below the threshold of detection in animals fed the 
high cholesterol diet. This spot has been tentatively identi- 
fied as the cytosolic HMG-CoA synthase, based on a reac- 
tion with an antiserum to that protein provided by Dr. Mi- 
chael Greenspan at Merck Sharp & Dohme Research Labo- 
ratories. This enzyme lies immediately before HMG-CoA 
reductase in the liver cholesterol biosynthesis pathway,and 
is known to be co-regulated with it. Spot 413 has an SDS 
molecular weight of about 54 000 and a CPK pi of -1 1.4, in 
reasonably close agreement with a molecular weight* of 
57300 and a CPK pi of -15.7 computed from the known se- 
quence of the hamster enzyme [43]. 

Using a classical product-moment correlation test (Kepler 
procedure CORREL), a series of five additional spots was 
found to be coregulated with 413. The level of correlation 
was exceedingly high (> 95%). Two of these, 1250 and 933, 
are at similar molecular weights and approximately one' 
charge more acidic than 413 (Fig. 9), indicating that they 
may be covaiently modified forms of the 413 polypeptide. 
This suspicion is strengthened by the observation that both 
spots are also stained by the antibody to cytosolic HMG- 
CoA synthase. The remaining three correlated spots appear 
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,o comprise an additional related I pair (1253 . and 1001) of 
around 40 kDa and a single spot (1119) of around 28 kDa. 
Because these two presumed proteins are present at sub- 

ilic HMG-CoA synthase is reponed to consist of only one 
vpe of polypeptide, they are likely to represent other, very 
ghtly coregulated enzymes. A second group of •« spo« 

was selected based on V'ftTT57l8 S ^ " 7 
verse of that for spot 413 (MSN's j4, 79, 1 /8, 182 204, >m, 
da a not shown). For these proteins, the lowest level of ex- 
p sion occurs with exposure to lovastatin plus cholestyra- 
P ! ,nJ i the hiehest level upon exposure to the high-cho- 
TT ni Si2 Snofs ; 182 and 79 are highly correlated and lie 
' h ine charge .pan at the same molecular weight; they 
T.y^^SL of a single protein. Tne other four 
"pots probably represent additional enzymes or subumts. 

3.3.2 MSN 235 and coregulated spots 

A third group of five spots, mainly comprised of ™todJon- 
dr a proteins including putative mitochondrial HMG- 
CoA synthase spots, showed a modest induction by lovasta- 
Untlone b Ut little or no effect with any of the other treat- 
ments (including the combination of lovastatin and cho es- 
mamine- Fig. 12).This result is intriguing because lovasta- 
n w a expected to affect only the regulation of enzymes of 
cholesterol svnthesis, which is entirely extra-mitochon- 
d„rSree of the spots (235, 134 144) form a closely- 
packed triad at approximately 30 kDa, and are hkely to re- 
present isoforms of one protein. All three spots are stained 
bv an antibody to the mitochondrial form of HMG-CoA 
svntnase obtained from Dr. Greenspan. Subcellular ■fractio- 
nation indicates a mitochondrial location, lite other two 
spoS (633 at about 38 kDa and 724 at about 6 kDa I are 
each present at lower abundance than the members of the 
triad. 
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proteins of the putative mitochondrial pathway are so 
much more variable in their expression in all groups. An ex- 
amination of all the coregulated groups suggests that quan- 
titative statistical techniques can extract a wealth of inter- 
esting information from large sets of reproducible gels.The 
abundance of spots in the 413 coregulation group.for exam- 
ple shows an amazing level of concordance in their relative 
expression among the five individuals of the lovastatin and 
cholestyramine treatment group. This effect is not due to 
differences in total protein loading, since they have already 
been removed by scaling, and since proteins with quite dif- 
ferent regulation patterns can be demonstrated (e.g.. Fig. 
13) Such effects raise the possibility that many gene coregu- 
lation sets may be revealed through the study of a suffi- 
ciently large population of control animals {i.e., without 
anv experimental manipulation).This approach, exploiting 
natural biological variation in protein expression instead of 
drug effects, offers an important incentive for the construc- 
tion of a large library of control animal patterns. 



4 Conclusions 

Because of the widespread use of rat liver in both basic bio- 
chemistry and in toxicology, there is a long-term need tor a 
comprehensive database of liver proteins.The rat liver mas- 
ter pattern presented here has proven to be an accurate re- 
presentation of this system, having been matched to more 
than 700 gels to date. As the number of proteins identified 
and the number of compounds tested for gene expression 
effects grows, we expect this database to contribute valu- 
able insights into gene regulation. Its practical utility in sev- 
eral areas of mechanistic toxicology is already being de- 
monstrated. 

Received September 11, 1991 



3.3 J An example of an anti-synergistic effect 

A sixth spot (367) shows strong induction by lovastatin 
Jwo to threefold), and about half as much induction with 
ovastatin plus cholestyramine,but -^ut sharing the ani- 
mal-animal heterogeneity pattern of the 235-set Fig 13) 
Tnis protein is also mitochondrial, and represents the clear- 
est example of an anti-synergistic effect of lovastatin and 
cholestyramine. The existence of such an effect demon- 
sfrates TaUovastatin and cholestyramine do not act exclu- 
sively through the same regulatory pathway. 

3.3.4 Complexity of the choleslerol synthesis pathway- 
Taken together, these results suggest that treatment with lo- 
. vLtatin tlone can affect both cytosohc and mitochondria 
Sways using HMG-CoA, while cholestyramine, on the 
o Sand Sher alone or in combination with lovastatin, 
pTduces a strong effect on the putative cytosohc pathway, 
out little or no effect on the putative mitochondria! path- 
way An explanation for this difference may lie in lovasta- 
SX effect on levels of HMG-CoA and related precursor 
compounds that are exchanged between the cytosol and 
the mitochondrion, whereas cholestyramine should affect 
onlyThe cytorolic pathway, directly controlled by cholester- 
oUnd bile acid levels. It remains to be explained why some 
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6 Addendum 1: Figures 1-13 
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Figure J. Synthetic representation of the standard rat liver 2-D master pattern, rendered as a greyscale image using a videoprinter. 
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< Figure 7. (a) Plot of computed isoelectric point versus gel /-position for 
two sets of carbamylated standard proteins (rabbit muscle CPK [+] and 
human hemoglobin P chain, filled diamonds) and several other proteins 
(shaded squares), (b) The identities of the various proteins represented 
by the squares are indicated by the numbers in corresponding positions 
on (a); these refer to Table 4. 
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f/gufe 9. Montage showing effects in the 
region of MSN :4 13. The montage showsa 
small window into one portion of the 2-D 
pattern, one row of windows for each expe- 
rimental group, and one panel for each gel 
in the experiment. The left-most pattern 
in each row is a group-specific copy of the 
master pattern followed by the patterns 
for the five individual rats in the group. 
The highlighted protein spots (filled circ- 
les) are spot 413 (on the right of each pan- 
el; identified as cytosolic HMG-CoA syn- 
thase) and two modified forms of it (1250 
and 933). From the top, the rows (experi- 
mental groups) are: high cholesterol, con- 
trols, cholestyramine, lovastatin, and lova- 
staiin plus cholestyramine. 
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Figure 10. Bargraph showing the quantita- 
tive effects of various treatments on the 
abundance of MSN:413 (cyiosolic HMG- 
CoA synthase) in the gels of Ftg. 9. 




Figure 11. Bargraphs of a series of six core* 
gulated spots including MSN:413. In the 
bargraphs. the abundances of the appro- 
priate spot (master spot number shown at 
the top of the panel) in each animal are 
shown. The five five-animal groups are in 
the order (left to right): high cholesterol, 
controls, cholestyramine, lovastatin, and 
lovastatin plus cholestyramine. Each bar 
within a group represents one experimen- 
tal animal liver(one 2-D gel). Note the cor- 
related expression of the 6 spots, espe- 
cially in the two far right (most strongly in- 
duced) groups. 



* • Eimrcphomts 1991, 12. 907-930 

L. Anderson etat. 




Figure 12, Data on a second coregulated 
group of spots, presented as in Fig. 11. The 
founh experimental group (lovastatin) 
shows a modest induction, while the fifth 
group (lovastatin plus cholestyramine) 
does not. 




367 




Figure 13. Data on spot MSN:367, presented as in Fig. U.This protein 
shows unambiguously the anti-synergistic effect of lovastatin Mdcholei- 
tvramine (fifth group) as compared to lovastatin (fourth group). This res- 
ponse contrasts strongly with the regulation pattern seen in Fig. 11. 



7 Addendum 2: Tables 1-4 



Table 1. Master table of proteins in the rat liver 



MSN 


X 


Y 


CFKDl 


SOSMW 


3 


311 


434 


<-35.0 


63.800 


5 


568 


263 


•24.3 


102,900 


B 


812 


426 


-16.0 


64.800 


11 


549 


268 


-25.2 


101.000 


15 


845 


520 


-15.3 


55,200 


17 


629 


589 


-21.6 


50.000 


18 


906 


414 


•14.0 


66.300 


19 


755 


298 


•17.5 


90.200 


20 


649 


403 


-20.9 


67.900 


21 


1204 


448 


-8.7 


62,100 


22 


332 


434 


<-35.0 


63,800 


23 


787 


424 


-16.6 


65.000 


24 


313 


417 


<-35.0 


66.000 


25 


807 


516 


-16.1 


55,500 


97 


1184 


524 


-9.0 


54.900 


9fl 


1 263 


446 


-8.0 


62,400 


£9 


743 


605 


-17.8 


49,000 


ir\ 
ou 


768 


112 


-17.2 


348.600 


^9 


1216 


417 


-8.6 


66,000 




1145 


445 


-9.5 


62,500 




1 037 


555 


-11.3 


52,400 




863 


412 


-14.9 


66.600 


OO 


719 


606 


-18.7 


48,900 


on 


/DO 


694 


-17.3 


43.800 


*ao 
jy 




470 


<-35.0 


59.800 


41 


1165 


569 


-9.2 


51.400 


49 


684 


607 


-19.6 


48,800 


43 


1318 


589 


-7.3 


50,000 


44 


1 924 


362 


-0.1 


74,600 


40 


1 (UJ 


586 


-8.7 


50,200 


d*7 
*♦/ 


i i 


447 


-6.3 


62,300 


j4Q 

4e 




454 


<-35.0 


61,500 


*y 




587 


-22.5 


50,100 


Dv 


A91 


535 


-21.8 


53,900 


C1 

31 


1110 


522 


-10.0 


55,000 




1 POO 


499 


-0.9 


57,000 




795 


177 


-18.3 


170,800 


04 


9<VM 


500 


>0.0 


56.900 




799 


830 


-18.4 


37.300 


56 


676 


533 


-19. 8 


54,100 


57 


1682 


302 


-2.5 


89,000 


56 


1091 


580 


-10.3 


50,600 


59 


1 171 


585 


-9.2 


50,300 


60 


1400 


624 


-6.2 


47,800 


61 


1653 


508 


-0.6 


56.200 


62 


1888 


567 


-04 


51,500 


65 


735 


297 


-18.1 


90,500 


66 


1263 


312 


•8.0 


85,900 


67 


1252 


407 


-8.1 


67,300 


68 


779 


692 


-16.8 


43,900 


69 


1064 


296 


-10.8 


90,800 


71 


656 


589 


-20.6 


50.000 


79 


638 


545 


-21.2 


53,100 


73 


1 562 


583 


-3.6 


50,400 


74 


1570 


556 


-3.8 


52,300 


75 


1264 


621 


-8.0 


48,000 


76 


1338 


564 


-7.0 


51,800 


77 


1833 


363 


-0.8 


74,400 


78 


1767 


565 


-1.5 


51.700 


79 


925 


738 


-13.6 


41,600 


80 


534 


698 


-26.1 


43.600 


81 


1811 


363 


-1.0 


74,500 


82 


1412 


681 


-6.0 


44.500 


83 


1471 


347 


-5.0 


77,500 


84 


1662 


563 


-2.7 


51.800 


65 


1596 


479 


-3.4 


58.900 


86 


1817 


301 


-0.9 


89.100 


87 


516 


1371 


-27.0 


17,400 


88 


1589 


698 


-3.5 


43,600 


69 


1706 


719 


•2.2 


42.500 


90 


651 


329 


-20.8 


61,700 


91 


1415 


710 


-6.0 


43,000 


92 


1773 


545 


-1.4 


53.200 


93 


1338 


446 


-7.0 


62,300 


94 


1708 


696 


-2.2 


43.700 



database* 1 



MSN 


X 


Y 


CPKol 


SDSMW 


95 


1119 


536 


•9.9 


53.800 


96 


1731 


756 


-2.0 


40,700 


97 


1033 


566 


-11.4 


51,600 


98 


1406 


565 


•6.1 


51,700 


99 


578 


1149 


-23.8 


25,000 


100 


2004 


538 


>0.0 


53,700 


101 


1106 


623 


-10.1 


47,900 


102 


482 


455 


-28.5 


61,300 


103 


665 


830 


-20.2 


37,300 


104 


773 


1182 


-17.0 


23.800 


105 


312 


1117 


<-35.0 


26.100 


106 


1769 


509 


-1.5 


56,100 


107 


1585 


720 


-3.6 


42.500 


108 


1692 


607 


-24 


38,300 


109 


1482 


593 


•4.8 


49.700 


110 


778 


516 


-16.9 


55,500 


111 


1728 


700 


-2.0 


43.500 


113 


1191 


680 


-6.9 


44,500 


114 


1298 


185 


•7.5 


160 800 


115 


682 


907 


-19.6 


34 100 


116 


1146 


610 


-9.5 


48 700 


117 


1548 


849 


*4.1 


36 500 


118 


1050 


577 


•1 1 .1 


50 800 


120 


1530 


628 


•4.3 


37 400 


121 


638 


423 


*1 5.4 


65 200 


122 


1572 


712 


•3.8 


42,900 


123 


23 


1433 


<-35.0 


15.30C 


124 


621 


1474 


-21.9 


13.90C 


125 


1298 


662 


•7.5 


36.00C 


126 


672 


921 


-14.7 


33 50C 


127 


1000 


717 


•1 2.0 


42 60C 


128 


1229 


31 1 


-8 4 


86 10C 


129 


1422 


832 


-5.8 


37 30C 


130 


1776 


499 


-1 4 


57 00C 


131 


1930 


757 


-u. \ 


40 70C 


132 


660 


537 


-20 4 


53 80C 


133 


666 


1019 


-20.2 


29 70C 


134 


1271 


862 


-7.9 


36 00C 


135 


1 161 


1389 


•9.3 


16.80C 


136 


453 


1063 


-29.7 


28.10C 


137 


1858 


823 


-0.6 


37.70C 


138 


1504 


697 


•4.6 


4370C 


139 


1488 


707 


•4.8 


43,200 


140 


1689 


756 


-2.4 


40700 


141 


311 


1417 


<-35.0 


15,800 


142 


1366 


915 


-6.7 


33,800 


143 


1429 


346 


•5.7 


77,900 


144 


615 


1017 


-22.1 


29.800 


145 


2006 


566 


>0.0 


51 ,600 


146 


2006 


518 


>0.0 


55,300 


147 


1070 


1 108 


-10.7 


26,500 


148 


1347 


578 


-6.9 


50,800 


149 


541 


1481 


-25.7 


13,700 


150 


1645 


760 


-2.8 


40,500 


151 


1269 


236 


-7.9 


1 17,000 


152 


1507 


911 


-4.5 


33,900 


153 


1722 


448 


-2.1 


62,100 


154 


932 


503 


-13.5 


56.600 


155 


1031 


294 


-114 


91,400 


156 


1970 


684 


>0.0 


44,400 


157 


1258 


183 


-8.1 


162,400 


158 


1275 


417 


-7.8 


65,900 


159 


1663 


620 


-2.6 


37,800 


160 


1034 


527 


-11.4 


54,600 


161 


1953 


771 


>0.0 


40,000 


162 


1020 


1482 


-11.6 


13,700 


164 


1566 


806 


-3.8 


38,400 


166 


1905 


565 


-0.2 


51,700 


167 


1340 


181 


-7.0 


164,900 


168 


1506 


583 


-4.6 


50,400 


169 


1338 


678 


-7.0 


44.700 


170 


1969 


541 


>0.0 


53,500 


171 


800 


378 


-16.3 


71,800 


172 


476 


958 


-28.7 


32.100 


173 


919 


1314 


-13.7 


19.300 
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MSN 


X 


Y 


CPKol 


SOSMW 


174 


1364 


183 


-6.7 


162.900 


175 


625 


393 


-15.7 


69.300 


177 


1582 


553 


-3.6 


£2,600 


178 


1321 


710 


-7.2 


43,000 


179 


1089 


615 


-10.4 


48.300 


180 


1866 


567 


-0.5 


51,600 


181 


411 


295 


-32.1 


91.200 


182 


804 


730 


-16.2 


42.000 


184 


1860 


896 


-0.6 


34.500 


185 


1997 


1017 


>0.0 


29,800 


186 


279 


1113 


<-35.0 


26,300 


187 


773 


296 


-17.0 


90.800 


188 


1538 


807 


-4.2 


38.400 


191 


1560 


674 


-3.9 


44.900 


192 


1818 


687 


•0.9 


44.200 


193 


1469 


555 


-5.0 


52,400 


194 


1380 


266 


-64 


101.600 


195 


784 


632 


-16.7 


47,300 


196 


1227 


1185 


•8.4 


23,700 


197 


667 


553 


-20.1 


52.600 


198 


2006 


681 


>0.0 


44,500 


199 


1711 


674 


-2.2 


44.900 


200 


872 


424 


-14.7 


65,000 


201 


292 


435 


<-35.0 


63 700 


202 


736 


253 


•18.0 


107,800 


203 


786 


829 


-16.7 


37,400 


204 


1224 


589 


•8.5 


50,000 


205 


439 


983 


-30.9 


31 .100 


206 


1994 


571 


>0.0 


51 300 


207 


1895 


687 


-0.3 


44 200 


208 


240 


1 418 


<-35.0 


1 5 800 


210 


1700 


499 


-2.3 


57 000 


211 


902 


517 


•14 1 


55 400 


213 


1087 


684 


-10.4 


44 400 


214 


1340 


668 


-7.0 




215 


1591 


495 


-3.5 




216 


1585 


755 


-3.6 


40 700 


217 


1159 


393 


-9.3 


69 300 


218 


931 


572 


-13.5 


51,200 


219 


713 


177 


-18.7 


170,500 


220 


1479 


911 


-4.9 


33,900 


221 


965 


927 


-12.8 


33,300 


223 


934 


716 


-13.5 


42,700 


225 


1812 


1045 


-1.0 


28.800 


226 


821 


411 


-15.8 


66,800 


227 


1566 


1483 


-3.6 


13,600 


228 


1065 


567 


-10.8 


51,600 


229 


1577 


890 


-3.7 


34,800 


230 


1456 


496 


-5.2 


57,300 


232 


1440 


849 


-5.5 


36.500 


234 


1692 


489 


-2.4 


57.900 


235 


618 


10O4 


-22.0 


30,300 


236 


920 


1138 


-13.7 


25.400 


237 


952 


1008 


•13.1 


30,200 


238 


1611 


541 


-3.2 


53,500 


239 


1489 


720 


4.8 


42,500 


240 


501 


448 


-27.7 


62,100 


241 


1820 


569 


-0.9 


51,400 


242 


1357 


658 


-6.8 


45,800 


243 


711 


1182 


-18.7 


23,800 


244 


1855 


621 


-0.6 


48,000 


245 


1189 


474 


^8.9 


59,300 


246 


551 


459 


" -25.1 


61,000 


247 


1348 


604 


-6.9 


49,100 


248 


460 


448 


-29.3 


62,100 


249 


1733 


451 


-1.9 


61,800 


250 


1974 


788 


>0.0 


39.200 


251 


808 


392 


-16.1 


69.500 


252 


874 


553 


-14.6 


52.500 


253 


753 


848 


-17.6 


36,500 


254 


995 


450 


-12.1 


61,900 


255 


1690 


679 


-2.4 


44,600 


256 


994 


1006 


-12.1 


30.200 


257 


508 


464 


-27.4 


60,400 


258 


1517 


820 


•44 


37,800 



a ' Master table of proteins in the rat liver database, showing spot master number,gel position (jr and >•). isoelectric point relative to CPK standards, and 
predicted molecular mass (from the standard curve of Fig. 8). 
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MSN 



Y CPKdI SDSMW 



MSN 



Y CPKDl SDSMW 



MSN 



Y CPKol SDSMW 



259 
260 
261 
262 
263 

265 

266 

267 

268 

269 

270 

271 

272 

274 

275 

276 

277 

278 

279 

281 

282 

283 

284 

285 

286 

288 

289 

290 

291 

292 

293 

294 

295 

296 

297 

299 

300 

301 

302 

303 

304 

305 

306 

307 

308 

309 

310 

311 

312 

313 

314 

315 

316 

320 

321 

322 

323 

324 

325 

326 

327 

328 

330 

331 

332 

333 

334 

335 

336 

338 

339 

340 

341 

343 

344 



1796 
661 
1725 
496 
1063 
1390 
510 
660 
430 
1044 
2019 
857 
895 
1292 
1350 
1670 
688 
961 
679 
1648 
1505 
1313 
1314 
1332 
1277 
1391 
1147 
925 
787 
1462 
531 
660 
1162 
218 
1377 
913 
2012 
702 
494 
403 
1843 
1049 
1608 
1219 
1627 
1524 
1769 
1609 
266 
1902 
1316 
1341 
1104 
1480 
850 
1454 
670 
655 
1521 
15B7 
1388 
448 
1606 
1566 
531 
784 
1059 
1S93 
1616 
1854 
1265 
581 
1497 
1351 
1613 



961 
1361 
679 
1127 
172 
673 
437 
1038 
961 
606 
853 
422 
968 
712 
590 
1089 
538 
718 
570 
1084 
525 
1147 
829 
408 
652 
824 
579 
511 
1476 
818 
449 
698 
609 
814 
979 
1523 
667 
178 
1280 
1008 
1585 
593 
969 
916 
755 
892 
1028 
1451 
1408 
1365 
1395 
523 
1053 
1459 
603 
1494 
626 
101 
675 
677 
409 
1291 
751 
697 
471 
1156 
407 
303 
598 
1004 
888 
585 
1047 
265 
549 



-1.1 
-20.4 
-2.0 

-28.0 
-10.9 
-6.3 
-27.3 
-20.4 
-31.0 
-11.2 
>0.0 
-15.0 
-14.2 
-7.6 
-6.9 
-2.6 
-19.4 
-13.0 
-14.5 
-0.7 
-4.6 
-7.3 
-7.3 
-7.1 
-7.8 
-6.3 
-9.5 
-13.6 
-16.6 
-5.1 
-26.3 
-14.9 
-9.3 
<-35.0 
-6.5 
-13.9 
>0.0 
-19.0 
-28.1 
-32.6 
-0.7 
-11.1 
-3.3 
-8.5 
-3.0 
-4.4 
-1.5 
-3.3 
<-35.0 
-0.3 
-7.3 
-7.0 
-10.1 
-4.9 
-15.1 
-5.3 
-20.0 
-20.6 
-4.4 
-3.6 
•6.3 
-30.0 
-3.3 
-3.8 
-26.3 
-16.7 
-10.9 
-3.5 
-3.2 
<>.6 
-8.0 
-23.6 
-4.7 
-6.8 
-0.9 



31,900 
17,700 
44,600 
25.800 
177.400 
45.000 
63,400 
29,000 
31,900 
48,900 
36,300 
65,200 
' 31.700 
42.900 
49.900 
27,100 
53.700 
42.600 
51,300 
27,300 
54.800 
25,100 
37,400 
67.200 
46,100 
37,600 
50,700 
55,900 
13.900 
37.800 
62.000 
. 43.600 
48.700 
38,000 
31,300 
12,400 
45,300 
169,200 
20,400 
30,100 
10,300 
49,600 
30,900 
33,700 
40,700 
34,700 
29,400 
14,700 
16,100 
17.600 
16,600 
54.900 
28,500 
14,400 
49,100 
13,300 
47,700 
420,500 
44,800 
44.700 
67.000 
20,100 
40,900 
43,700 
59,600 
24.700 
67,300 
88,500 
49,400 
30,300 
34,900 
50,300 
28.700 
102.200 
52.800 



345 

346 

347 

348 

349 

350 

351 

352 

353 

354 

355 

356 

357 

358 

359 

360 

361 

362 

363 

364 

365 

366 

367 

366 

369 

370 

371 

372 ■ 

373 

374 

375 

376 

377 

378 

379 

381 

382 

383 

384 

385 

386 

387 

388 

389 

390 

391 

392 

393 

394 

395 

396 

397 

399 
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Tabic 3. Computed pfs of two sets of carbamylaied protein standards: Rabbit muscle CPK and human 
hemoglobin (Hb) 
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T£ble 4. Computed pfs of some known proteins related to measured CPK pfs 
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High Specific Activity Chemiluminescent and 
Fluorescent Markers: their Potential 
Application to High Sensitivity and 
'Multi-analyte' Immunoassays 

Roger Ekins*, Frederick Chu and Jacob Mica lief 

Department of Molecular Endocrinology. University College and Middlesex School of Medicine 
University of London, Mortimer Street, London W1N8AA. UK ™eaic.ne. 

The sensitivities of immunoassays relying on conventional radioisotopic labels (i e 
radioimmunoassay (RIA) and immunoradiometric assay (IRMA)) permit the measurement of" 
analyte concentrations above ca 10 7 molecules/ml. This limitation primarily derives in the 
case of competitive or limited reagent' assays, from the 'manipulation errors arising in he 
system combined with the physicochemical characteristics of the particular antibod? used 
however, in the case of 'non-competitive' systems, the specific activity of the label may play a 
more important constraining role, ft is theoretically demonstrable that the development of 
assay techniques yielding detection limits significantly lower than 10' molecules/ml depends 
on: 

(1) the adoption of 'non-competitive' assays designs- 

(2) the use of labels of higher specific activity than radioisotopes- 

involved i6nt dlSCrimlnation betw «*n «he products of the immunological reactions 

Chemiluminescent and fluorescent substances are capable of yielding higher specific activities 
than commonly used radioisotopes when used as direct reagent labels in this context and both 
thus provide a basis for the development of 'ultra-sensitive'. non-competitive. immunoassay 
methodologies. Enzymes catalysing chemiluminescent reactions or yielding fluorescent 
reaction products can likewise be used as labels yielding high effective specific activities and 
hence enhanced assay sensitivities. 

A particular advantage of fluorescent labels (albeit one not necessarily confined to them) lies 
in the poss.bilrty they offer of revealing immunological reactions localized in microspots' 
distributed on an inert solid support. This opens the way to the development of an entirely new 
generation of ambient analyte' microspot immunoassays permitting the simultaneous 
measurement of tens or even hundreds of different analytes in the same small sample, using 
(for example) laser scanning techniques. Early experience suggests that microspot assays with 
sensitivities surpassing that of isotopically based methodologies can readily be developed 

microscopy U ' trasens,tive imm "n°assay; fluorescent microspot immunoassay; confocal 
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• INTRODUCTION 

Immunoassay methods relying on radioisotopic 
labels have played a major role in medicine and 
other biologically related fields (agriculture, 
veterinary science, the food and pharmaceutical 
industries, etc.) during the past two decades. 
Their importance has derived from the exploita- 
tion both of the 'structural specificity' characteriz- 
ing antibody-antigen reactions and the 'detecta- 
bility' of isotopically-labelled reagents, the latter 
permitting observation of the binding reactions 
between exceedingly small concentrations of the 
key reactants involved. The combination of these 
features has endowed radioimmunoassay 
methods with unique specificity and sensitivity 
characteristics, and accounts for their ubiquitous 
use throughout modern medicine and biology. 
However, in the past few years, interest has 
increasingly focused on so-called 'alternative', 
non-radioisotopic, immunoassay methods; such 
techniques are based on essentially identical 
analytical principles but differ in the markers used 
to label the particular immunoreactant (antibody 
or analyte) whose distribution between bound 
and free moieties (following the basic analytical 
reaction) constitutes the assay 'response'. The 
reasons for this interest may be grouped under 
four headings: 



(1) Environmental; logistic; economic; practical- 
ity and convenience, etc ; (i.e. 'non-scientific). 

(2) The attainment of higher sensitivity. 

(3) The development of 'immunosensors' and 
'immunoprobes'. 

(4) The development of 'multi-analyte' assay 
systems. 

Our own reasons for developing non-isotopic 
techniques fall principally under headings (2) and 
(4), and this presentation will centre primarily on 
the concepts which underlie our immunoassay 
development strategy in these areas. 



THE ATTAINMENT OF ULTRA-HIGH' 
IMMUNOASSAY SENSITIVITY 

Though, as indicated above, the sensitivity of 
radioisotopically based immunoassay methods 
has constituted one of the principal foundations 
of their widespread use over the past 25 years, a 



fundamental reason for their replacement stems 
paradoxically, from the current requirement to 
develop microanalytical techniques which are 
superior to them in this particular respect. 
Radioisotopic methods are, in practice, limited tb 
the measurement of analyte concentrations above 
about 10 8 -10 9 molecules/ml (i.e. approx 0.15-1 5 
pmol/l)(Dakubu etal., 1984). However, in certain 
fields (e.g. virology, tumour detection) there is a 
particular need to detect or measure molecular 
concentrations below this level. The factors which 
determine immunoassay sensitivity have been 
extensively discussed (Ekins et al., 1968, 1970a- 
™ S 'J? 78; Jackson « 1983; Dakubu et al.] 
1984; Ekins, 1985). Nevertheless, some of the 
underlying concepts are still frequently misunder- 
stood and merit brief discussion in the present 
context. 



The concept of sensitivity 



One major source of past confusion has been 
disagreement regarding the concept of 'sensitiv- 
ity' itself, many authors equating assay sensitivity 
with the slope of the dose-response curve (Yalow 
and Berson, 1970a, b; Berson and Yalow 1973- 
see also Ekins et al., 1970b, Tait, 1970). It 'is now 
widely agreed that the notion that a steeper 
dose-response curve implies greater sensitivity is 
erroneous. The invalidity of this belief is clearly 
revealed by the fact that the relative magnitudes 
of the responses yielded by two assay systems is 
dependent on the particular variable which is 
chosen to represent the response (see Fig 
l(a))(Ekins, 1976). For this and other reasons, it 
has long been recognized that the 'sensitivity' of 
an assay can only be satisfactorily represented by 
its lower limit of detection (Fig. 1(b)), and this 
concept is now embodied in all internationally 
agreed definitions of the term. An essentially 
identical definition is as the precision (i e 
standard deviation) of measurement of zero dose 
since this quantity determines the least quantity 
distinguishable from zero and hence the assay 
detection limit. The sensitivity of an assay is thus 
represented by the zero-dose intercept of the 
'precision profile' (Fig. 2(a)) when the latter is 
expressed in terms of standard deviation rather 
than of coefficient of variation (Ekins, 1983a). In 
short, the more sensitive of two assays is the one 
yielding greater precision of the zero dose 
estimate (Fig. 2(b)). 
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F/B plot 



B/F plot 




Any plot 




(rtspont*) 



a. b. 

Figure 1. (a) Diagrammatic representation of conventional RIA dose-response curves for systems using high (hi) and low (lo) 
antibody concentrations plotted in terms of free-bound (F/B) and bound/free (B/F) labelled antigen. Note that the use of a lower 
amount of antibody yields a dose-response curve of greater slope in the F/B plot, but of lower slope in the B/F plot It is 
impossible to decide, on the basis of the data shown in this figure, which concentration of antibody yields the assay system of 
higher sensitivity, (b) The sensitivity of an assay is essentially represented by the minimum detectable dose i e the SD of the 
dose measurement (SD t<;?se) ) at zero dose. This is given by the SD of the response (SD (responM0 ) divided by the dose-response 
curve slope at zero dose (i.e. «SD, response ,) x dD/d/?) 0 ). This quantity is unaffected by the choice of the coordinate frame used to 
plot the dose-response curve. {Note: it is common to multiply (SD (doseJ ) 0 by an arbitrary factor to increase the confidence level 
attaching to the minimum detectable dose estimate, though, since no agreement exists regarding the value of this factor this 
unnecessary step merely adds to confusion when the relative sensitivities of two assay procedures are compared ) 



'Competitive' and 'non-competitive' (limited 
reagent' and 'excess reagent') assays 

A second important misconception in this area is 
the notion that immunoassays relying on the use 
of labelled antibodies (e.g. immunoradiometric 
assays, IRMA) are ipso facto more sensitive than 



those which rely on the use of labelled 'analyte' 
(e.g. radioimmunoassays, RIA); furthermore the 
grounds originally advanced for the claimed 
superiority of labelled antibody methods (Miles 
and Hales, 1968) were partially based on false 
concepts of sensitivity, and thus failed to identify 
the true reasons why certain assay designs are 




Figure 2. (a) The 'precision profile' of an assay portrays the error in the dose measurement as a function of dose. The error 
may be represented, inter alia, by the absolute error (AD; e.g. SD of D) or the relative error (AD/D; e.g. CV of D). (AD) 0 . the 
error in the measurement of zero dose, represents the sensitivity of the assay. The working range may be defined as the range 
of dose values within which AD/D is less than an 'acceptable* value set by the investigator, (b) The more sensitive of the two 
assays (assay I) intercepts the AD axis at a lower value. However, assay II is more precise at higher values of dose, and has a 
wider working range. 
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potentially capable of yielding far higher sensitiv- 
ity than others. This issue likewise merits 
clarification. 

The purely pragmatic sub-classification of 
immunoassays into labelled antibody and labelled 
analyte methods diverts attention from a more 
fundamental divide in immunoassay methodolo- 
gy, which relates to the optimal concentration of 
antibody required in an assay system to maximize 
its sensitivity. In certain assay designs (which may 
be termed 'limited reagent' or 'competitive') the 
optimal concentration tends to zero; conversely in 
others (which may be termed 'excess reagent' or 
'non-competitive') the concentration tends to 
infinity. It should be particularly emphasized that 
the optimal antibody concentration is essentially 
governed, not only by the physicochemical charac- 
teristics of the antibody-analyte binding "reaction, 
but also by the errors incurred in measurement of 
the assay response. Were an assay system to be 
totally error-free, no antibody concentration 
would be optimal, and the distinction between 
competitive and non-competitive methodologies 
would thus not arise. 

Though it is inappropriate in this presentation 
to discuss in detail the statistical and physico- 
chemical theory underlying this fundamental 
divergence in immunoassay design (see Ekins et 
al, 1968, 1970a; Jackson et a/., 1983), the reason 
for it can perhaps be more readily understood if 
the basic principles of immunoassay are portrayed 
in a somewhat different way from that in which 
they are usually presented. All immunoassays 
essentially depend upon measurement of the 
'fractional occupancy' by analyte of antibody 
binding sites following reaction of analyte with 
antibody (see Fig. 3(a)). Those techniques which 
implicitly rely on measurement of residual, 
unoccupied, binding sites optimally necessitate 
the use of concentrations of antibody tending to 
zero, and may be termed 'competitive 1 , converse- 
ly those in which occupied sites are directly 
measured necessitate use of high antibody con- 
centrations and are termed 'non-competitive* 
(Fig. 3(b)). This emphasizes that the differences 
in assay design characterizing so-called competi- 
tive and non-competitive methods are essentially 
unrelated to which component (if any) of the 
reaction system is labelled. Indeed immunoassays 
in which no label of any kind is involved can, on 
identical grounds, be subdivided into those of 
limited reagent' (or 'competitive') and 'excess 
reagent' (or 'non-competititve') design. Thus the 



distinction between these two forms of im- 
munoassay simply reflects differences in the way 
that fractional antibody occupancy is determined, 
and the fact that it is generally undesirable — for 
reasons of accuracy — to measure a small quantity 
by estimating the difference between two large 
quantities. When an immunoassay relies on the 
measurement of unoccupied antibody binding 
sites, the total amount of antibody used in the 
system must be small to minimize error in the 
resulting (indirect) estimate of occupied sites. 
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Figure 3. The distinction between 'non-competitive' (above) 
and 'competitive' immunoassays (below) reflects how 
antibody binding-site occupancy is measured. Labelled 
antibody methods are 'non-competitive' if occupied sites of 
the (labelled) antibody are measured, but are 'competitive* 
(below right) when unoccupied sites are measured. Labelled 
antigen (below left) or labelled anti-idiotypic antibody 
methods (below centre) rely on measurement of sites 
unoccupied by analyte. and are therefore invariably* of 
'competitive' design. 
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Figure 4. Curves showing the theoretically predicted relationship between antibody affinity and the sensitivities achievable 
using 'competitive' and 'non-competitive' essay strategies. The 'potential' sensitivity curves assume the use of infinite specific 
activity labels; the sensitivities achievable using 125 l-labelled antigen or antibody are also shown. Shaded areas indicate the 
sensitivity loss due to errors in measurement of the label. Curves relating to 'competitive' assays assume a 1% error in 
measurement of the response variable arising from 'experimental' errors (i.e. errors other than those inherent in label 
measurement perse). Non-competitive curves assume 'non-specific binding' of labelled antibody of 0.01% and 1% (lower and 
upper curves) respectively. Arrows indicate sensitivities claimed for typical non-competitive immunoassay methodologies. 



Conversely, when occupied sites are measured 
directly, this particular constraint does not arise; 
indeed, considerable advantage often derives 
from using relatively large amounts of antibody in 
the system. 



Sensitivity of 'competitive' and 
'non-competitive' immunoassays 

Competitive and non-competitive immunoassays 
differ significantly in many of their performance 
characteristics in consequence of the differences 
in optimal antibody concentration on which they 
rely. Most particularly they differ in their 
potential sensitivities. Figure 4. portrays the 
sensitivities predicted theoretically as a function 
of antibody binding affinity, making realistic 
assumptions regarding the experimental errors 
incurred in reagent manipulation, 'non-specific' 
binding of labelled antibody, etc., and assuming 
the use of optimal reagent concentrations (Ekins, 
1985). Amongst other concepts illustrated in the 
figure is the much greater assay sensitivity 
potentially attainable (using an antibody of given 
affinity) by adoption of a non-competitive 
approach. In short, whereas the maximal sensitiv- 



ity realistically achievable using a competitive 
design is in the order of 10 7 molecules/ml (using 
antibody of the highest affinity found in practice), 
a non-competitive method is capable of yielding 
sensitivities some orders of magnitude greater 
than this. However, Fig. 4 also demonstrates that, 
assuming the use of high affinity antibodies (i.e. 
~-10 u -10 12 1/m), maximal sensitivities yielded by 
isotopically based techniques (whether relying on 
labelled antibody (IRMA) or labelled analyte 
(R1A), or whether of competitive or non- 
competitive design) are closely comparable, i.e. 
of the order of J0 7 -10 8 molecules/ml. 

This limitation is a manifestation of the fact 
that, in the case of the non-competitive methods, 
an important constraint on assay sensitivity is 
(under certain circumstances) the 'specific activ- 
ity' of the label used. On the other hand, 
limitation of assay sensitivity due to the low 
specific activity of radioisotopic labels does not 
often arise, in practice, in the case of competitive 
assays, whose sensitivity is generally restricted by 
other factors (Ekins, 1985). The fundamental 
significance of this conclusion is that, only by the 
use of labels possessing specific activities higher 
than those of the commonly used radioisotopes in 
assays of non-competitive design, can current 
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• sensitivity limits be breached. Conversely, use of 
a higher specific activity label in a competitive 
assay will usually have no significant effect on its 

, sensitivity (assuming experimental errors incur- 
red in reagent manipulation of the magnitude 
generally encountered in practice). 



High specific activity non-isotopic labels 

The term 'specific activity' is conventionally 
applied, in the case of radioisotopic labels, to 
denote the number of radioactive disintegrations 
per unit time per unit weight of the isotope or 
labelled compound. In the present context, use of 
the term is widened to signify 'detectable events' 
per unit time per unit weight of labelled material. 
Thus it can be used to indicate the rate of photon 
emission by a chemiluminescent or fluorescent 
label, or the rate of conversion of substrate 
molecules — by an enzyme label— to molecules of 
a detectable product. The importance of the 
concept derives from the fact that 'signal 
measurement error' (i.e. error in the measure- 
ment of the label per se) is a contributory factor in 
limiting assay sensitivity, and may— when other 

sensitivity-constraining factors are reduced 

become dominant. Furthermore, when extending 
the sensitivities of immunoassay systems beyond 
their present limits, the numbers of molecules 
involved are low, and statistical errors incurred in 
counting individual 'detectable events', and the 
time required to count them, may assume a 
particular importance. 

Table 1 compares the specific activities of 
potentially useful labels with that of 125 L All are 
of relevance in the context of this volume since 
chemiluminescent and fluorescent labels can be 
used to label antibodies (or antigens) directly; 
alternatively, enzyme labels catalysing reactions 
yielding chemiluminescent signals or fluorescent 
products can be utilized. 



The importance of background in 
non-competitive immunoassays 

A second important factor governing the sensitiv- 
ity of non-competitive labelled-antibody im- 
munoassays is the 'background' or 'blank' signal 
emitted in the absence of analyte, since error in 
the measurement of this signal is clearly a major 
determinant of the error in measurement of zero 



Table 1. Relative specific activities of various 
isotopic and non-isotopic labels. Note that, though 
the specific activity of ^-labelled reagents does 
not, in practice, significantly limit the sensitivity of 
competitive assays (see Fig. 4), the lower specific 
activity of 3 H may severely restrict the sensitivity 
of competitive assays (e.g. of steroid hormones) 
which rely on the use of this particular radioiso- 
tope 

Specific Activities 



l: 1 detectable event/se</7.5 x 10 6 

3 labelled molecules. 

H: 1 detectable event/sec/5.6 x 10 8 

labelled molecules. 
Enzymes: Determined by enryme 'emplifica. 

tion factor' and detectability of 
reaction product 
Chemiluminescent 1 detectable event/labelled mole- 
labels cule. 
Fluorescent labels: Many detectable events/labelled 
molecule. 



dose. Amongst contributors to the background 
signal are the 'noise' of the measuring instrument 
itself, 'ambient' signal generators (such as, in 
'sandwich' immunoassays, solid 'capture- 
antibody' supports or, in the case of radioisotopic 
methods, cosmic ray and other extraneous radia- 
tion sources) and 'non-specifically bound' label- 
led antibody. Minimization of each of these 
components is essential for maximal sensitivity: 
mere arithmetic subtraction of background is of 
absolutely no benefit in this context. 

Non-specific binding of antibody is of particular 
interest, since the magnitude of this contribution 
is dependent, inter alia, on the amount of labelled 
antibody used in the system, and the duration of 
its exposure to analyte. Thus increasing the 
amount of labelled antibody increases the amount 
of such antibody bound to analyte; however, it 
may also increase the non-specifically bound 
moiety to a greater proportional extent, and thus 
cause a net reduction in sensitivity. This effect 
underlies the loss in sensitivity at higher antibody 
concentrations depicted in Fig. 5 (reproduced 
from Jackson et al., 1983). This phenomenon also 
underlies the relationship between sensitivity and 
the affinity constant of the labelled antibody 
depicted in Fig. 4. The possession by labelled 
antibody of a high affinity constant implies that a 
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Figure 5. Assay sensitivity (represented by the standard 
deviation of the 2ero dose measurement. o 0 ), plotted as a 
function of the concentration of labelled antibody (of affinity 
10 n L/M) used in the assay, assuming different levels of 
non-specific binding of labelled antibody. (Note: an irreducible 
instrument background has been assumed in the computa- 
tions represented; this limits the ultimate sensitivity attain- 
able, regardless of the concentration of antibody used.) 



lower concentration is required to yield the same 
level of analyte binding, albeit with reduced 
non-specific binding, thus increasing assay sensi- 
tivity 

In summary, the high sensitivity of non- 
competitive labelled antibody methods derives 
essentially from their permitted use of optimal 
concentrations of antibody which (provided non- 
specific binding of labelled antibody is low) 
are generally considerably greater than in com- 
petitive methods, not from the fact that the 
antibody is labelled. Labelled antibody methods 
generally fall in sensitivity as the concentration of 
antibody is reduced towards zero, ultimately 
yielding a sensitivity theoretically identical to that 
of competitive methods (Rodbard and Weiss, 
1973). (Paradoxically, early exponents of labelled 
antibody methods, whilst claiming them to be of 
higher sensitivity, also concluded that their 
sensitivity was increased by reduction in the 
amount of labelled antibody used (Woodhead ex 
a/., 1971). This incorrect conclusion — based on 
observation of effects on the slope of the 
dose-response curve — exemplifies the many falla- 
cies encountered in the immunoassay field stem- 
ming from confusion regarding the concept of 
sensitivity discussed above.) Finally it should be 



emphasized that maximization of the sensitivity of 
a non-competitive immunoassay generally implies 
the selection of reagent concentrations and other 
experimental conditions such that the [analyte 
signal/background] ratio (i.e. s/b) is maximized. 
However, this simple relationship disregards 
statistical considerations which arise when the 
numbers of detectable events are very low, and a 
more appropriate objective may, under these 
circumstances, be maximization of the ratio s*/b 
(Loevinger and Berman, 1951). 



Other performance characteristics of 
competitive and non-competitive 
immunoassays 

Non-competitive designs also display a number of 
other advantages deriving from the relatively high 
antibody concentrations on which they generally 
rely. These include increased reaction speeds 
(and hence shorter incubation times), decreased 
vulnerability to certain environmental effects 
(which cause variations in binding affinity be- 
tween antibody and analyte), reduced sensitivity- 
dependence on high antibody binding affinity, 
etc. 

Nevertheless a price has to be paid for these 
benefits; this includes the greater tendency of a 
large amount of antibody to bind molecules 
differing from, but with structural resemblance 
to, the analyte itself, implying a loss of assay 
specificity. This effect generally necessitates the 
use, whenever possible, of an 'immunoextraction' 
procedure using a second 'capture* antibody 
(usually directed against a different binding site, 
or 'epitope') as shown in Fig. 3(b). This 
technique — the 'sandwich' or 'two-site 1 im- 
munoassay (Wide, 1971) — thus potentially com- 
bines the twin virtues of ultra-high sensitivity and 
specificity (together with short reaction time), 
features of crucial importance in many diagnostic 
situations (for example, in the detection of AIDS 
viral antigens). (Note, however, that the loss of 
specificity inherent in non-qompetitive assay 
designs implies that they are less readily applic- 
able to the measurement of analytes of small 
molecular size, which cannot be simultaneously 
bound by two different antibodies directed 
against different antigenic sites on the molecule. 
Such analytes are generally more appropriately 
measured using "competitive' assay methods.) 
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Development of ultra-sensitive 
immunoassay methodologies 

The perception that the development of Ultra- 
sensitive' immunoassay systems (i.e. systems 
surpassing conventional RIA methods in sensitiv- 
ity) depends on (a) reliance on 'excess reagent' or 
'non-competitive' assay designs; (b) the use of 
non-isotopic labels displaying higher specific 
activities than commonly used radioisotopes; (c) 
the development of efficient separation systems 
(ensuring minimization of non-specific antibody 
binding, and hence of signal 'backgrounds'), and 
(d) dual or multi-antibody analyte-recognition 
systems (exemplified by 'sandwich' or two-site 
assays) to maintain/increase assay specificity, has 
formed the basis of our own laboratory's im- 
munoassay development since the early to mid- 
1970s (Ekins, 1978). This led us, inter alia, to an 
immediate recognition (Ekins, 1979, 1980) of the 
importance of the in vitro techniques of mono- 
clonal antibody production pioneered by Kohler 
and Milstein (1975), which are currently the 
subject of bitter patent disputes in the USA 
(Ezzell, 1986, 1987a,b), and which may be 
expected in Europe. 

Meanwhile, of the candidate labels for use in 
this context, both chemiluminescent and fluores- 
cent labels offer many attractions. The develop- 
ment of stable, highly chemiluminescent, acridi- 
nium esters by McCapra and his colleagues 
(McCapra et aL, 1977) has subsequently been 
exploited by Weeks et al (1983, 1984) and, more 
recently, by several commercial kit manufactur- 
ers; other workers have used more conventional 
chemiluminescent compounds to label immuno- 
assay reagents (see, for example, Kohen et al., 
1984, 1985; Barnard eval., 1985). Yet others have 
relied on enzyme labels to catalyse chemilumi- 
nogenic (Whitehead etaL, 1983) and fluorogenic 
(Shalev et al., 1980) reactions as indicated above. 
Detailed description of these various methodolo- 
gies is presented by others in this volume and 
need not be duplicated here. 

Common to all the 'ultra-sensitive' immuno- 
assay methodologies relying on such alternative 
labels is their dependence on a non-competitive, 
labelled antibody, assay strategy whenever 
appropriate; however, for the reasons indicated 
above, competitive methods continue to be 
generally employed for the measurement of 
analytes of small molecular size (e.g. therapeutic 
drugs, steroid and thyroid hormones, etc.). 



Nevertheless, the convenience (from a manufac- 
turing viewpoint, and for other technical reasons) 
of relying on standard labelling procedures has 
meant that, even in these cases, labelled antibody 
techniques are increasingly preferred. Though the 
commercial kits based on these various labels 
differ to a minor extent in sensitivity, specificity, 
convenience, etc., such differences are at least 
partially attributable to differences in the physi- 
cochemical characteristics of the antibodies used 
in the kits, and to other 'immunological' factors 
unconnected with the particular nature of the 
label per se. 

Despite the obvious attractions of chemi- 
luminescent techniques in an immunoassay con- 
text, the use of fluorescent labels combined with 
sophisticated time-resolution techniques for their 
detection (a concept arising from discussions with 
J. F. Tait in 1970) appeared to us (in the 
mid-1970s) to offer more exciting long-term 
possibilities for a number of reasons. These 
naturally included attainment of the enhanced 
specific activities and high signal to background 
ratios required for ultra-sensitive immunoassay as 
indicated above. However, more importantly, 
fluorescence techniques also appeared to provide 
a simple route to the development of 'multi- 
analyte' assay systems of the kind described 
below. 

In pursuance of this strategy, we began 
collaboration with LKBAVallac, ca 1976-77, in 
the development of the instrumentation and 
technology required to develop such methods. 
Fortunately a group of fluorescent substances 
generally known as the lanthanide chelates 
(including, in particular, the chelates of euro- 
pium, samarium and terbium facilitate such 
development, possessing prolonged fluorescence 
decay times (-10-1000 jis), large Stokes shift 
(-300 nm) and other desirable physical character- 
istics which permit the construction of relatively 
cheap instrumentation for their measurement 
(Marshall et a/., 1981 ; Hemmila et aL, 1983). The 
fluorescent properties of the lanthanide chelates 
may be compared with those of a conventional 
fluorophor such as fluorescein which is characte- 
rized by a much smaller Stokes shift (-28 nm), 
and a fluorescent decay time and emission 
spectrum which imply that it is less readily 
distinguished from fluorescent substances present 
in blood (such as bilirubin) or in plastic sample 
holders. The unique fluorescence characteristics 
of the lanthanide chelates thus permit them to be 
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measured in the presence of a fluorescence 
background (deriving from extraneous sources) 
which, in practice, approaches zero. Fig. 6 
illustrates the basic concepts involved in pulsed- 
light, time-resolved, fluorescence measurement, 
which form the basis of the DELF1A immunoas- 
say system currently marketed by LKBAVallac. 

Though it is inappropriate to pursue this 
subject in greater detail, attention should also be 
drawn to the possibilities offered by phase- 
resolved fluorimetry. This permits separate iden- 
tification of fluorophores differing in fluorescence 
lifetime by their exposure to a sinusoidally 
modulated exciting light source, and observation 
of their demodulated, phase-shifted, light emis- 
sion (McGown and Bright, 1984). This technique 
offers the possibility both of the development of 
homogeneous assays (relying on a difference in 
fluorescence decay time of bound and free forms 
of the fluorescent-labelled molecule), and of 
discriminating between two labelled antibodies in 
the context of multi-analyte 'ratiometric' im- 
munoassay as discussed below. 



'AMBIENT ANALYTE' IMMUNOASSAY 

Before proceeding to a discussion of the develop- 
ment of multi-analyte assays, another important 
concept, termed 'ambient analyte immunoassay' 
(Ekins, 1983b), must first be examined. This 
term is intended to describe a typfe of immuno- 
assay system which, unlike unconventional 



methods, measures the analyte concentration in 
the medium to which an antibody is exposed, 
being essentially-independent both of sample 
volume, and of the amount of antibody present. 
This concept is illustrated in Fig. 7, and relies on 
^the physicochemically-based proposition that, 
when a 'vanishingly small 1 amount of antibody 
(preferably, but not essentially, coupled to a solid 
support) is exposed to an analyte-containing 
medium, the resulting (fractional) occupancy of 
antibody binding sites solely reflects the ambient 
analyte concentration. Clearly the binding by 
antibody of analyte results in a depletion of the 
amount of analyte in the surrounding medium, 
but provided the proportion so bound is small 
(i.e. less than, for example, 1% of the total), such 
disturbance can be ignored. (This effect is closely 
analogous to that caused by the introduction of a 
thermometer into a medium possessing a much 
larger thermal capacity; the temperature disturb- 
ance caused by the thermometer itself is negligi- 
ble and can, in these circumstances, be disre- 
garded.) 

The principles of ambient analyte assay derive 
from the recognition that all immunoassays 
essentially depend upon measurement of the 
'fractional occupancy' by analyte of antibody 
binding sites following reaction of analyte with 
antibody as discussed above (Figs 3. (a) and (b)). 
The fractional occupancy of ('monospecific' or 
'monoclonal') antibody binding sites in the 
presence of varying analyte concentrations, glpt^ 
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ted against antibody concentration, is portrayed 
in Fig. 8. The fraction of analyte bound is also 
plotted in this figure. (Note: for the sake of 
generality, all concentrations in this figure are 
expressed in terms of 1/K, where K is the affinity 
constant of the antibody. For example, if K = 
10 n L/M, a concentration of 0.1 x UK represents 
0.1 x l(T n M/L,or0.1 x 10" n x 10" 3 x 6.02 x 
10 23 = 6.02 x 10 8 molecules/ml.) 

It should be particularly noted that, at antibody 
concentrations of less than a? 0.01 x 1/K antibody 
fractional occupancy is essentially dependent 
solely on the analyte concentration in the 
medium, and is independent of variations in 
antibody concentration. This reflects the fact that 
this concentration of antibody binds less than 
approximately 1% of the analyte in the medium, 
irrespective of its concentration. This implies, for 
example, that the introduction of 10, 100, or 1000 
antibody molecules into a medium containing 
billions of analyte molecules will result, in each 
case, in virtually identical fractional antibody 
binding-site occupancy, the upper limit of anti- 
body concentration being determined by the 
antibody affinity constant. (An antibody concen- 
tration of 0.01 x \IK is a hundred-fold less . than 



that (1 x 1/K) necessary to bind 50% of a 'trace' 
amount of analyte (see Fig. 8), claimed by Berson 
and Yalow (1973) as maximizing assay 'sensitiv- 
ity' (i.e. the slope of the dose-response curve 
when expressed in terms of bound/free labelled 
analyte). This false conclusion has subsequently 
become incorporated into the mythology of 
radioimmunoassay design which, regrettably, a 
majority of kit manufacturers continue to accept.) 

The ambient analyte assay concept was origi- 
nally exploited in the original development of 
what has come to be known as 'two-step' free 
hormone immunoassay (Ekins et al. y 1980), but it 
is clear that it is of far wider application, and can, 
in particular, be utilized in the construction of 
immunosensors and immunoprobes. One such 
example is a probe for the measurement of 
salivary steroids that is currently being developed 
in our laboratory. Comprising a small antibody- 
coated plastic lipstick' comparable in size and 
shape to a clinical thermometer, this device is 
intended to permit the measurement of salivary 
steroid levels without requiring the collection of 
saliva. However, the concept also underlies our 
approach to multi-analyte immunoassay, also 
under development in our laboratory. 
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MULTI-ANALYTE 'RATIOMETRIC 
IMMUNOASSAY SYSTEMS 

The concepts relating to ambient analyte im- 
munoassay and assay sensitivity outlined above 
are both exploited in our present development of 
a random access, multi-analyte, immunoassay 
technology capable of measuring, in the same 
small sample, virtually any number of individual 
analytes from selected analyte 'menus' (e.g. a 
hormone menu, viral antigen menu, an allergen 
menu, etc.). Many examples of a need to measure 
a multiplicity of different analytes in the same 
sample exist in medical diagnosis, for example, in 
the routine diagnosis of thyroid disease, where it 
is frequently necessary to measure a number of 
different hormones and thyroid-related proteins. 
At present, clinicians frequently experience diffi- 
culty in deciding on the best sequence of tests to 
arrive at a correct diagnosis. Such problems 
would be overcome were all relevant analytes 
measurable at a cost comparable to the cost of 
measurement of a single substance. Our own 
immediate objective is the development of a 
technology permitting the measurement of com- 
plete 'hormone profiles' using a single small blood 
sample. However, the need for 'multi-analyte', or 
'random access' measurement is not confined to 
medical diagnosis: it also arises, for example, in 
the pharmaceutical industry (where there exists a 
requirement to ensure the purity of protein drugs 
synthesized by recombinant DNA techniques), in 
the food industry and elsewhere. Though still at 
an early stage, our approach to the achievement 
of this objective can be briefly indicated. 



Multi-analyte assay: general principles 

As discussed above, the notion of ambient 
analyte assay'*stmuhan*oirsly-'"'ifttroduces two 
extremely important and novel concepts: (a) that 
an estimate of analyte concentration can be based 
upon the use of an infinitesimal amount of 
'sampling' antibody, and (b) that such an'estimate 
derives from a direct measurement of fractional 
antibody occupancy by analyte, irrespective of 
the exact amount of antibody used. It should be 
emphasized that the latter proposition is valid 
only in the context of ambient analyte assay, and 

depends both upon the amount of antibody in the 



system, and sample volume— see Fig. 8). In short, 
exposure of a small number of antibody mole- 
cules (in the form, for example, of a 'microspot' 
located on a solid support) to an analyte- 
containing fluid results in occupancy of antibody 
binding sites in the microspot reflecting the 
analyte concentration in the medium. Following 
such exposure, the antibody-bearing probe may 
be removed and exposed to a 'developing' 
solution containing a high concentration of an 
appropriate second antibody directed against 
either a second epitope on the analyte molecule if 
this is large (i.e. the occupied site), or against 
unoccupied antibody binding sites in the case of 
small analyte molecules (see Fig. 3(b)). (Note: an 
antibody simulating antigen, and reacting with 
unoccupied binding sites, is described as a 
'mirror-image anti-idiotypic antibody'; the use of 
such an antibody instead of labelled antigen is 
convenient but not essential, and is suggested 
here merely to simplify illustration of the basic 
concepts involved.) 

Subsequently, an estimate of binding-site occu- 
pancy of the 'sampling' (solid phase) antibody 
located in the microspot may be derived by 
measurement of the ratio of signals emitted by the 
two antibodies forming the dual-antibody 'coup- 
lets'. This can be conveniently achieved by 
labelling the 'sampling' and 'developing' anti- 
bodies with different labels, for example, a pair of 
radioactive, enzyme or chemiluminescent mar- 
kers. Fluorescent labels are nevertheless particu- 
larly useful in this context because, by the use of 
optical scanning techniques, they permit arrays of 
different antibody 'microspots' distributed over a 
surface, each directed against a different analyte, 
to be individually examined, thus enabling 
multiple assays to be simultaneously carried out 
on the same small sample. Fig. 9 illustrates these 
basic ideas, and Fig. 10 such an array. 



Microspot immunoassay sensitivity: 
theoretical considerations 

The notion that it is, in principle, possible to 
measure an analyte concentration using a micros- 
pot of antibody comprising a number of antibody 
molecules in the range ca lO'-lO* is likely, at first 
sight, to appear surprising, and may, indeed, 

Clearly a number of factors, suckas the sensitivity 
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of the signal measuring equipment, the density of 
antibody molecules on the surface of the solid 
support, etc., are likely to play a part in 
determining final assay sensitivity. Such factors 
are, in turn, dependent on the efficiency with 
which the particular labels used can be detected 
the adsorption properties of antibody supports,' 
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Figure 10. Multi-analyie' antibody array. Each antibody 
'microspof represents a 'vanishingly small' amount of 
antibody directed ?JJ3' n st an nnd^vidual analyte. 



etc. Though these are obviously variable, reason- 
able estimates can be made of the order of 
sensitivities likely to be achieved on the basis of 
some simple theoretical calculations. To clarify 
the following discussion, it is assumed that 
'sensing' antibody can be uniformly and consis- 
tently coated on a solid matrix at a standard 
density, implying that only the 'developing' 
antibody need be labelled and measured in order 
to ascertain fractional occupancy of sensing 
antibody binding sites. 

Fig. 1 1 illustrates the surface of an antibody 
microspot, of surface area A(\im 2 ), and (uniform- 
ly) coated with antibody of affinity K(UM) in a 
monomolecular layer of density D(molecules/ 
Jim ). Let us assume that the spot is exposed to an 
analyte-containing medium of volume v(ml), and 
containing an analyte concentration C molecules/ 
ml. The molecular concentration of antibody in 
the system is thus given by AD/ v. (Note: the fact 
that antibody is situated on the surface of a solid 
support, and not evenly distributed throughout 
the medium, does not affect the extent of analyte 
binding at thermodynamic equilibrium, assuming 
that antibody binding sites are not impeded in 
their reactions and have not been damaged during 
the coating process.) 
Meanwhile, fractional occupancy (F) of anti- 
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Figure 11. Microspot ambient-analyte immunoassay. The microspot shown is assumed to be uniformly coated with antibody, 
though if the dual-labelled antibody 'ratiometric' approach shown in Fig. 9 is adopted, uniform coating is not essential. The 
minimum fluid volume for ambient analyte assay conditions to prevail (enabling adoption of the ratiometric approach) is shown. 
Minimum test sample volume (M/S): A x D x K x 10 5 //V 



given by the equation: 

F 2 - F(Vq + plq + 1) +p/q = 0 (1) 

where p = analyte concentration, q = antibody 
concentration (both expressed in units of UK). 

Thus, for antibody binding site concentrations 
-> 0 (i.e. q < 0.01), F ~ pl{\ + p); (see Fig. 8). 

Likewise, the fraction of analyte bound by 
antibody (/) at equilibrium is given by the equation: 

f 2 ~ f(W + qlp + 1) + <?//> = 0 (2) 

Thus, for analyte concentration — » 0 (i.e. p < 
0.01),/ ~ ql{l + q)\ (see Fig. 8). Furthermore, 
when q < 0.01, and when p 2* 0,/< 0.01. 

Expressed in units of \IK ; the concentration (q) 
in the assay of 'sensing' antibody situated on the 
microspot is given by DAK/(v x 6 x 10 20 ), (since 
Avogadro's constant, expressed as the number of 
molecules/rnmol, is 6 x 10 20 (approximately)). 
The fraction of an analyte concentration — ► 0 
which will be bound to the spot is therefore 
DAK/(v x 6 x JO 20 + DAK), implying that the 
number of analyte molecules bound to the spot is 
given by vCDAK/(v x 6 x 10 2 ° + DAK). 



Case 1: sandwich (two-site) assay. Following 
incubation of sample with antibody, we assume 
the sample is removed, and the microspot then 
exposed to a volume V(ml) of a solution of a 

K* (L/M) at a concentration gr^n nby Q 
(expressed in units of 1/K*). 



The fraction of analyte bound by labelled 
antibody (F*) at equilibrium is given by the 
equation: 

r 2 - F*(\/P + QIP + 1) -f QIP = 0 (3) 

where P represents the analyte concentration in 
the developing-antibody solution, expressed in 
units of 1AK*, i.e. vCDAKKV[(v x 6 x 10 20 + 
DAK)V x 6 x 10 20 ]. 

Assuming P < 0.01, F* = Q/(l + 0. (For 
example, if Q - 1, the fraction of analyte 
molecules bound by labelled antibody = 0.5 
approximately). Thus, since the number of 
analyte molecules bound to the spot is given by 
vCDAKI{y x 6 x 10 20 + DAK), the number of 
analyte molecules labelled by the second, de- 
veloping, antibody is given by vCDAKQ/[(v x 6 
x 10 20 + DAK)(\ + £?)], and the surface density 
of such molecules is given by vCDKQf[(v x 6 x 
10 20 + DAK) (1 + 01 Moreover, assuming that 
DAK < v x 6 x 10* (i.e. that the amount of 
antibody in the system is such that 'ambient assay' 
conditions prevail, then the surface density (D*) 
of developing-antibody molecules = CDKQI[(h 
x 10 20 )(1 + Q)] approximately. It should be 
noted that D* is independent of both v and V, 
also that the ratio DVD = C x KQ/[(6 x 10 20 )(1 
+ Q )) = C x constant. 

If the minimum detectable surface density of 
developing-antibody molecules (i.e. 0/>o, the 
standard deviation of the measurement of D* 

and Cm in representslne minimum detectable 
analyte concentration in the test sample, then, 
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disregarding non-specific binding of developing 
antibody within the microspot area, 

. C min - D* in x [(6 x 10 20 )(1 + Q)]IDKQ (4) 

For example, if Q = 1 , Z? = 10 s molecules/jim 2 , K 
= 10 n L/M and Z)* in « 20 molecules/fim 2 , then 
C min = 2.4 x 10 6 molecules/nil = 10" I5 M/L. It 
should be noted, in this example, the fractional 
occupancy of the sensing antibody binding sites 
by the minimum detectable analyte concentration 
is 0.04%. 

Case 2: anti-idiotypic antibody ('competitive') 
assay. In this case, we assume that, following 
removal of the sample, the microspot is exposed 
to a volume V(ml) of a solution of (for example) a 
second, labelled, anti-idiotypic antibody reacting 
with unoccupied sites on the sensing antibody. 
Using similar reasoning as above, we may 
likewise assume that the fraction of such sites 
which become occupied by the anti-idiotypic 
'developing' antibody is given by Q/(] + 0, 
where Q is the developing-antibody concentra- 
tion. However, the minimum detectable surface 
density of anti-idiotypic antibody is not, in a 
competitive design, the critical determinant of 
assay sensitivity; this parameter is essentially 
governed by the precision of the density measure- 
ment. 

From Eq. (1), the fraction of sites unoccupied 
by analyte = 1/(1 + p), and the fraction occupied 
by anti-idiotypic antibody = £?/(] + p)(l + 0. 
Thus, if the CV in the measurement of anti- 
idiotypic antibody is e, the standard deviation is 
e£?/(l + p)(l + Q). This term also represents the 
SD in the estimate of the fraction of sites occupied 
by analyte. Since the total number of antibody 
binding sites in the spot is DA, the SD in the 
estimate of occupied sites as p -» 0 (i.e. oD<*) 
approximates zDAQ/(] + 0; the SD in the 
occupied site surface-density estimate is thus- 
zDQI{\ + 0. But the SD in the measurement of 
fractional binding-site occupancy when p — » 0 
defines D min , and hence the minimum detectable 
analyte concentration in the test sample as 
indicated in Eq (4). 

Thus 

Cmin = D min x [(6 x 10 20 )(1 + Q))IDKQ (5) 

= zDQI{\ + Q) ± [(6 x lO^Xl + 0] 

DKQ (6) 



For example, if values of Q = 1, D = 10 5 
molecules/^m 2 , and K = 10 n L/M are assumed as 
in the non-competitive^ example considered 
above, and the CV in the measurement of 
anti-idiotypic antibody density in the microspot is 
1% (i.e. e = 0.01), then D mi . = 500 molecules/ 
Urn 2 , and C min = 6 x 10* molecules/ml = 
10~ 13 M/L. Fractional occupancy of the sensing 
antibody binding sites by the minimum detectable 
analyte concentration is, in this example, 1%. It 
should be noted that the sensitivity limit of zlK 
(expressed in molar terms) is identical to that 
previously established for conventional Competi- 
tive' assays (Ekins and Newman, 1970), and 
which underlies the predictions represented in 
Fig. 4. 

Such considerations appear to suggest (a) that 
microspot assay sensitivities superior to those 
obtainable by conventional radioisotopically 
based immunoassays are achievable, and (b) that 
sensitivities yielded by non-competitive microspot 
assays are likely to be considerably greater than 
those of corresponding competitive microspot 
assays. It must be emphasized, however, that, 
though such predictions, are likely to prove 
correct, assumptions regarding the performance 
of the labels and signal-measuring instrument 
used are incorporated in the simple theoretical 
analysis discussed above. Such factors are clearly 
of importance in determining overall microspot 
immunoassay performance. 



Practical implementation 

The concepts discussed above are clearly exploit- 
able using a variety of antibody labels, including 
chemiluminescent labels; however, our prelimin- 
ary studies have been based on the . use of 
conventional fluorophores, since the technology 
of simultaneous measurement of dual fluoresc- 
ence from small areas is already well established. 
Because this volume centres on chemiluminesc- 
ence, we shall provide only a brief indication of 
our initial experimental work in this area, which is 
currently based on the use of commercially 
available confocal microscopes. 

In strum entation: the laser scanning confocal 
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ence microscopy, a small area of the specimen is 
illuminated by a focused laser beam; the fluoresc- 
ence photons emanating solely from this area are, 
in turn, focused onto a photon detector. Both the 
intensity of illumination and the efficiency of light 
collection diminish rapidly with distance from the 
focal plane (Fig. 12). At the 'confocal* point, the 
projection of the illumination pinhole and the 
back -projection of the detector pinhole coincide. 
Such systems contrast with conventional epi- 
fluorescence methods, where the specimen is 
exposed to an essentially uniform flux of illumina- 
tion (White etal., 1987). 

Sensitivity of current instruments. Typically, 
fluorescence photons emanating from the laser- 
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Figure 12. Principle of the confocal microscope. Illuminating 
light is focused at a point in the focal plane. Reflected light 
from this point is focused onto a detector. A complete 

r iml W 9' 0 ' irnenSional ima 9 e of Str uctures Within thft fnral nlano it 

33BBRWWfB? scanning the ie\e€\W^iis oH^^W^&^ 
be stored in a microcomputer for video display 



illuminated area are detected by a low dark 
current photomultiplier. Electrons spontaneously 
emitted by the photomultiplier photocathode 
contribute to the background signal of the 
instrument, and must, for highest sensitivity be 
minimized. Fortunately the overall design of such 
instruments permits the photomultiplier photo- 
cathode to be of very small area, so that this 
particular source of background noise is not only 
small, but can be expected to reduce in relative 
importance with future improvement in photo- 
multiplier design. Meanwhile current instruments 
already display very high sensitivity of detection 
of fluorescent signals. For example, the confocal 
microscope manufactured by Zeiss is claimed to 
display a lower detection limit for fluorescein of 
about ten molecules/um 2 (Ploem, 1986). Most 
commercially available FITC-labelled IgG attains 
a fluorophore/protein molar ratio of -4- thus the 
de ' e ? io J? "j™ 1 < D -in) of the Zeiss microscope is 
-2-3 FITC-labelled IgG molecules/fim 2 . This 
implies an analyte-concentration detection limit 
of -2 A x 10 5 molecules/ml for a two-site assay 
assuming the same parameter values as used in 
the examples discussed above, or 2.4 x ]0 4 

10° 2 UM S/ml US ' n8 3 Sensing ' ant ^ od y °f affinity 
Another comparable instrument is the Bio- 
Rad/Lasersharp. laser scanning confocal micro- 
scope, which we are currently using in the 
development of 'ratiometric' multi-analyte assay 
methodology in accordance with the principles 
outlined above (see Fig. 13). The" argon laser in 
this system possesses two excitation lines at 488 
and 514 nm. It is thus particularly efficient for the 
excitation of blue/green emitting fluorophores 
such as FITC (which displays an excitation 
maximum at 492 nm). However, it is considerably 
less efficient in the excitation of red-emitting 
fluorophores such as Texas red (excitation max- 
imum 596 nm). However, the ratiometric im- 
munoassay principle permits considerable varia- 
tion in detection efficiencies of the two labels 
relied on since, inter alia, the specific activities of 
the two labelled antibody species forming the 
antibody couplets can be chosen to yield optimal 
signal ratios in the region of unity. Thus 
inefficiency of the argon laser in exciting red 
emitting fluorophores is not necessarily a major 
handicap in the present context. 

. JFre hes on a convepiorf§!»n?fcr o^ofe rlrfter fnTn a 



purpose-designed optical system (and appears to 
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Antibody microspot 



Figure 13. Dual-channel confocal fluorescence microscope 
permitting simultaneous measurement of the fluorescence 
signals from two fluorophors situated at the focal point. By 
scanning the antibody array, the ratio of signals from each 
antibody microspot may be determined 

be less sensitive), it permits quantification of 
fluorescence signals generated from microspots of 
selected area. Initial studies have revealed that, 
under conditions that are not necessarily optimal, 
the instrument is capable of detecting approx- 
imately twenty-five FITC-Iabelled IgG molecules/ 
fxm 2 , scanning an area of ^50jim (Fig. ]4). It 
must be stressed that neither of these confocal 
microscopes are designed specifically for routine 
ratiometric multi-analyte immunoassay use, and 
it can be anticipated that future instruments 
constructed specifically for this purpose are likely 
to prove both cheaper and more sensitive. 

Other instruments. The MPM 200 Microscope 
^Photometer m^uMfaff^By ^eiss dfV^esi 



Germany is anticipated to become available 
shortly. This photometer is claimed to be highly 
versatile: it can be used in transmission and 
reflection modes, and as a highly sensitive 
fluorimeter. The measuring field can be varied in 
shape and size for optimum adjustment to the 
specimen structure. More generally, the technol- 
ogy of sensitive light measurement is improving 
rapidly in response to needs in astronomy, the 
space program etc., such technology clearly being 
readily exploitable in a multi-analyte immuno- 
assay context using light-generating labels in 
accordance with the broad principles presented 
here. 

Solid antibody supports. On the basis of the 
theoretical considerations discussed above, it is 
evident that solid antibody supports for multi- 
analyte immunoassay use should display a capac- 
ity to adsorb a high surface density of antibody 
combined with low intrinsic signal-generating 
properties (for example, low intrinsic fluoresc- 
ence), thus minimizing background. We have 
examined a number of materials, including 
polypropylene, Teflon, cellulose and nitrocellu- 
lose membranes and microtitre plates (clear 
polystyrene plates from Nunc; black, white and 
clear polystyrene plates from Dynatech with- 
these criteria in mind. White Dynatech Micro- 
fluor microtitre plates, formulated specially for 
the detection of low fluorescence signals, yield 
high signal-to-noise ratios and have therefore 
been provisionally used in our developmental 
studies. 

Surface density of antibody coating. Preliminary 
experiments using Microfluor plates have re- 
vealed that it is possible to coat them with 
antibody at a surface density of at least 5 x 10 4 
IgG molecules/jim 2 (Fig. 15). Moreover nearly all 
antibody molecules so deposited appear to retain 
immunological activity (Fig. 16). 

Verification of the 'ratiometric' imunoassay con- 
cept. Our primary intention, in initial studies, has 
been establishment of the basic conditions which, 
using a particular instrument, can be anticipated 
on theoretical grounds to yield high assay 
sensitivity. Though the setting up of individual 
microspot immunoassays has thus appeared to us 
to be of secondary importan^ during the initial 
stages of bur' * "si uttfes ^wi^a v : £ nevertheless 
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Figure 14. Fluorescence signal {arbitrary units), measured using the BioRad/Lasersharp scanning confocal microscnop 
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Figure 16. Surface density of immunoreactive IgG molecules (number of molecules/u.m 2 ) plotted as a function of the total 
surface density of IgG (number of molecule s/u-rrr) on Dynatech Microfluor white microtitre plates 



thought it useful to confirm the validity of our 
general concepts by comparing the performance 
of certain assays when constructed in microspot 
format and when conventionally designed. For 
example, we have compared a dual-labelled 
tumour necrosis factor (TNF) ratiometric assay 
system using Texas red and FITC-labelled anti- 
bodies with an optimized IRMA system using 
identical antibodies but with the second antibody 
l25 Mabelled. Although unoptimized, the 
ratiometric microspot assay yielded formal sensi- 
tivity values closely approaching that of the 
conventional, optimized, IRMA. Although 
verifying the general concepts underlying 
ratiometric microspot immunoassay methodolo- 
gy, further work is required to achieve the 
considerably greater sensitivity that theory pre- 
dicts as achievable using optimized reagent 
concentrations and improved instrumentation. 



CONCLUSION 

As indicated above, differentiation of the fluores- 
cent signals yielded by two fluorophores can be 
readily achieved solely on the basis of wavelength 
differences, and this approach has been relied on 
entirely in our preliminary studies. However, 



other physical techniques exploiting differences in 
decay time of two or more fluorescence emissions 
(using, for example, a pulsed or sinusoidally 
modulated laser source, and time- or phase- 
resolving detectors) are available, arid can be 
expected both to further reduce background and 
to improve signal resolution, thus increasing assay 
sensitivity and precision. These considerations 
aside, the basic technology involved closely 
resembles that employed in domestic compact 
disk recorders and other similar data-storage 
devices, the obvious difference being that light 
emitted from each of the discrete zones forming 
the antibody-array is fluorescent rather than 
reflected, and yields chemical rather than physical 
information. Indeed, our preliminary studies 
suggest that highly sensitive immunoassays using 
antibody microspots of surface area approximat- 
ing 50 nm 2 are achievable, implying that some 
2,000,000 different immunoassays could, in prin- 
ciple, be accommodated on a surface area of 
1 cm 2 . Though non-specific binding of a multiplic- 
ity of developing antibodies would probably 
prohibit the use of antibody arrays of this order, it 
is evident that the technology is capable of 
encompassing analyte numbers of the kind likely 
to be useful in practice. 

The development, of multi^analyte assay sys- 
tems of this kind can be anticipated to bring about 
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fundamental changes in medical diagnosis and 
many other biologically related areas. Systems 
capable of measuring every hormone and other 
endocrinologically related substance within a 
single small sample of blood are within technolo- 
gical reach, providing data which, when analysed 
with the aid of computer-based 'expert' pattern- 
recognition systems, are likely to reveal endoc- 
rine deficiences only dimly perceived using 
current 'single-analyte' diagnostic procedures. 
Such systems also provide a means to the 
development of a 'random access' immunoassay 
methodology, permitting the selection of any 
desired test or combination of tests from an 
extensive analyte menu. Clearly the accommoda- 
tion of a wide range of individual immunoassays 
on a small immunoprobe (comparable in its 
overall physical dimensions with a few drops of 
blood) is likely to totally transform the logistics of 
immunodiagnostic testing, and genuinely repre- 
sents, in our view, 'next generation' immunoassay 
methodology. 
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Multianalyte Microspot Immunoassay— Microanalytical "Compact Disk" of the Future 
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Throughout the 1970s, controversy centered both on im- 
munoassay "sensitivity" per se and on the relative sensi- 
tivities of labeled antibody (Ab) and labeled analyte meth- 
ods. Our theoretical studies revealed that RIA sensitivities 
could be surpassed only by the use of very high-specific- 
activity nonisotopic labels in "noncompetitive" designs, 
preferably with monoclonal antibodies. The time-resolved 
fluorescence methodology known as delfia— developed in 
collaboration with LKB/Wallao— represented the first com- 
mercial "ultrasensitive" nonisotopic technique based on 
these theoretical insights, the same concepts being sub- 
sequently adopted in comparable methodologies relying 
on the use of chemiluminescent and enzyme labels. How- 
ever, high-specific-activity labels also permit the develop- 
ment of "multianalyte" immunoassay systems combining 
ultrasensitivity with the simultaneous measurement of tens, 
hundreds, or thousands of anatytes in a small biological 
sample. This possibility relies on simple, albeit hitherto- 
unexploited, physicochemical concepts. The first is that all 
immunoassays rely on the measurement of Ab occupancy 
by analyte. The second is that, provided the Ab concentra- 
tion used is 'Vanishingly small," fractional Ab occupancy is 
independent of both Ab concentration and sample volume. 
This leads to the notion of "ratiometric" immunoassay, 
involving measurement of the ratio of signals (e.g. t fluores- 
cent signals) emitted by two labeled Abs, the first (a 
"sensor" Ab) deposited as a microspot on a solid support, 
the second (a "developing" Ab) directed against either 
occupied or unoccupied binding sites of the sensor Ab. Our 
preliminary studies of this approach have relied on a 
dual-channel scanning-laser confocal microscope, permit- 
ting microspots of area 100 /un 2 or less to be analyzed, 
and implying that an array of 1 0 6 Ab-containing microspots, 
each directed against a different analyte, could, in princi- 
ple, be accommodated on an area of 1 cm 2 . Although 
measurement of such analyte numbers is unlikely ever to 
be required, the ability to analyze biological fluids for a wide 
spectrum of analytes is likely to transform immunodiagnos- 
tics in the next decade. 

Additional Keyphraaes: ratiometric immunoassays • scanning- 
laser confocal microscope • fiuoroimmunoassay 

Immunoassay and other protein-binding assay meth- 
ods based on the use of radioisotopic labels have played 
a major role in medicine during the past three decades. 
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Their utility and importance have derived primarily 
from the structural specificity of many reactions be- 
tween binding proteins and analytes and the detectabil- 
ity of isotopically labeled reagents, the latter endowing 
such techniques with "exquisite sensitivity " Recently, 
however, interest has increasingly focused on noniso- 
topic techniques based on identical analytical princi- 
ples, differing only in the nature of the marker used to 
label the reactant (e.g., antibody or antigen), whose 
distribution between reacted ("bound 1 *) and unreacted 
Cfree") fractions constitutes the assay "response." 

The basic aims underlying this interest can be 
broadly classed under four main headings: 

• avoidance of the environmental, legal, economic, and 
practical disadvantages of isotopic techniques (e.g., lim- 
ited shelf life of isotopically labeled reagents, problems 
of radioactive waste disposal, cost and complexity of 
radioisotope counting equipment), particularly those 
impeding the development of, for example, simple diag- 
nostic kits^or home or doctor's office use; 

• achievement of greater assay sensitivity; 

• "direct" measurement of analyte concentrations by 
use of transducer-based "immunosensors"; 

• simultaneous measurement of multiple analytes 
( ,l multianalyte assay"). 

In this presentation I will focus primarily on the last 
of these objectives, using this to set out the principles 
underlying our present attempts to develop a new ^min- 
iaturized" technology that will permit the simultaneous 
measurement of an unlimited number of analytes in a 
small biological sample 6uch as a single drop of blood. 
However, retention (and, if possible, improvement) of 
the high sensitivities of conventional isotopic tech* 
niques is a basic aim not only of our own studies in this 
area but also of most other endeavors falling under the 
above headings. It is therefore appropriate to preface 
this paper with a discussion of the general principles 
underlying the attainment of high binding-assay sensi- 
tivity. 

Immunoassay Sensitivity: Some Basic Concepts 

Definition of Assay Sensitivity 

The need to establish assay conditions yielding max- 
imal sensitivity underlay the independent construction 
of mathematical theories of immunoassay design by 
both Yalow and Berson (1) and Ekins et al. (2) in the 
course of the original development of these methods in 
the early 1960s, Regrettably, these theoretical studies 

led to a prolonged controversy, arising largely from the 
conflicting concepts of "sensitivity" adopted by the two 
groups (see Figure 1). Briefly, Berson and Yalow, in 
their many publications relating to immunoassay de- 
sign (e.g:, 2, 3), defined sensitivity as the slope of the 
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Rg. 1. The differing concepts of sensitivity and precision underlying 
radioimmunoassay design theories developed by (left) YaJow and 
Benson (e.g.. U 3) and (right) Bans et ai. (2, 4) 
Yalow and Berson define assay A as more sensitive because it yields a 
response curve of greater slope. Ddns at ai. define assay B as more sensitive 
because the imprecision of measurement of zero dose (<r 0 ) is less. Yalow and 
Betson likewise define an assay system as more precise ft it yields a steeper 
response curve when data are plotted on a tog dose scale 

response curve relating the fraction or percentage of 
labeled antigen bound (b) to analyte concentration ([H]). 
In contrast, Ekins et al. (e.g., 2,4) defined sensitivity as 
the (imprecision of measurement of zero dose, this 
quantity being indicative of, and essentially equivalent 
to, the lower limit of detection. 

The key difference between these two definitions 
clearly lies in the dependence of the assay detection 
limit on the error (imprecision) in the measurement of 
the response variable. By neglecting this crucial factor, 
the "response curve slope" definition leads to many 
obvious absurdities. For example, plotting conventional 
RIA data in terms of the response metameter B/F (i.e., 
the bound to free ratio) suggests that assay "sensitivity" 
is increased by increasing the antibody concentration in 
the system; however, the converse conclusion is reached 
if identical data are plotted in terms of F/B'(see Figure 
2). Observation of the shape and slopes of response 
curves without detailed error analysis thus constitutes a 
totally misleading guide to optimal immunoassay de- 
sign. This approach has, however, characterized many 
of the studies conducted in the immunoassay field dur- 
ing the past 30 years, and has been the source of much 
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Rg. 2. Schematic representation of RIA dose-response curves 
observed for high and low antibody concentrations plotted in terms of 
(l$H) the free/bound fraction (RB); (center) the bound/free fraction 

(B/F) 

Note thai the low antibody concentration yields a response curve of greater 
slope when the assay response ts plotted in terms of F/B, but of lower elope 
when plotted in terms of B/F. The precision of measurement of zero dose 
(ADo) is Independent of the coordinate frame used to plot assay data (see 

right) 



mythology. For example, consideration of the Law of 
Mass Action reveals that, when response curves corre- 
sponding to different antibody concentrations are plot* 
ted in terms of b vs IH], the mnyiTnfQ slope at zero dose 
is obtained for a concentration of Q.5IK (where K is the 
affinity constant), in which circumstance the zero dose 
response (b 0 ) is 33%. This conclusion led to Berson and 
Yalow's enunciation of the well-known dictum (which, 
albeit 'erroneous, is broadly adhered to by many immu- 
noassay practitionera and kit manufacturers) that, to 
maximize RIA sensitivity, the amount of antibody to use 
in the system is that which binds 33% of labeled antigen 
in the absence of unlabeled antigen (I, 3). 

Disagreement regarding the concept of sensitivity 
inevitably led to prolonged dispute regarding immu- 
noassay design (5). However, although it is still common 
to encounter publications in the field that rely solely on 
the response curve slope as a measure of sensitivity, the 
assay detection limit is now widely accepted as the only 
valid indicator of this parameter, and we do not there- 
fore intend to dwell further on this issue here. It is 
nevertheless relevant to an understanding of the "min- 
iaturized" assay methodology described below to empha- 
size that untenable concepts of both sensitivity and 
precision underlie many of the commonly accepted rules 
governing current immunoassay-design practice, some 
of which are contravened in our own approach. 

Basic Immunoassay Designs 

It is likewise important in the present context to 
comprehend the basis of the various types of immunoas- 
says currently in use, and the constraints on the sensi- 
tivities of which they are potentially capable. The radio- 
immunoassay and analogous protein-binding assay 
techniques originally developed for the measurement of 
insulin by Yalow and Berson (6), and of thyroxin and 
vitamin B 13 by Ekins and Barakat (7. 8\ relied on the 
use of a labeled analyte marker to reveal the products of 
the binding reactions between analyte and binder (Fig- 
ure 3, left). This approach has subsequently often been 
portrayed as relying on "competition" between labeled 
and unlabeled analyte molecules for a limited number of 
protein-binding sites, such assays being frequently re- 
ferred to as "competitive.* 

Subsequently, Wide et al. in Sweden (9), followed 
shortly by Miles and Hales in the U.K. (JO), developed 
labeled antibody methods (Figure 3, right). These meth- 
ods represented an extension of the "labeled reagent" 
methods (utilizing radiolabeled organic compounds such 
as lsl Mabeled p-iodosulfonyl chloride, [ 3 H]acetic anhy- 
dride, and other similar reagents) devised, during the 
early 1950s, by Keston et al. (J J), Avivi et al. (12), and 
others for quantifying amino acids, steroid and thyroid 
hormones, etc. Although radiolabeled antibody methods 
(immunoradiometric assays; IRMAs) were originally 
claimed (13) to be more sensitive than methods based on 
the use of radiolabeled analyte, these claims were sup- 
ported by neither rigorous theoretical analysis nor per- 
suasive experimental evidence, and for some time re- 
mained controversial. Further doubt on their validity 
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Fig. 3. Labeled-anslyte (left) and labeled-antibody [right) assay 
systems compared 

Ubeled*nalyte assay systems essentially rely on observation of an anaJyte 
•'marker- to reveaJ the products of the reaction between anafyte and antibody 
(although the labeled analyte is not necessarily identic*! to the unlabeled 
analyte in Its binding characteristics ws-i-ws antfeody). Note that, irrespective 
of which fraction of the labeled analyte is measured after the binding Reaction, 
the optimal antibody concentration required to maximize sensitivity In such a 
system tends toward zero (assuming a background signal of 0). Labeled- 
antibody systems rely on observation of an antibody "marker to reveaJ the 
products of the bintfng reaction between analyte and antibody, In this ease 
the optimal antibody concentration required to maximize aenaftMty tends 
toward zero when the 'fre*" antibody fraction is measured, but tends toward 
Infinity when the bound fraction Is determined (likewise assuming zero 
background} 



was cast by the publication by Rodbard and Weiss in 
1973 (14) of detailed theoretical studies demonstrating, 
that both labeled analyte and labeled antibody methods 
possessed essentially equal sensitivities. (Note: These 
authors suggested that IRMAs might be more sensitive in 
the assay of small polypeptides, in which radioiodine 
incorporation into the antigen molecule was restricted; 
conversely, these assays would be less sensitive for the 
measurement of antigens of high molecular mass.) Nev- 
ertheless, despite the appearance of this publication, the 
belief that labeled antibody methods per se are intrin- 
sically more sensitive than the corresponding labeled 
analyte methods gained wide acceptance among clinical 
chemists. 

The reason for confusion on this issue is that the 
greater potential sensitivity of certain assay formats is 
not really a consequence of the labeling of antibody as 
opposed to analyte; indeed, the apparent antithesis 
between labeled-analyte and labeled-antibody methods 
diverts attention from the true reasons underlying the 
superior sensitivity of certain assay designs. Theoretical 
analysis (see, e.g., 4, 15) reveals that, assuming *per- 
feet" separation of the products of the binding reaction 
(i.e., no misclassifi cation of bound and free moieties), the 
optimal antibody concentration (for rmnrirr^l sensitiv- 
ity) in a labeled analyte immunoassay invariably tends 
to zero, irrespective of whether the free or bound labeled 
analyte fraction is measured, whereas in labeled-anti- 
body methods the optimal antibody concentration de- 
pends on which labeled-antibody fraction is measured 
(see Figure 3). If the free (unreacted) antibody fraction is 
measured, the optimal concentration also tend*4o zero; 
conversely, if the analyte-bound fraction is measured, 
the concentration tends to infinity. In short, of the four 
basic measurement strategies available — labeled ana- 
lyte, with measurement of free or bound reaction prod- 
uct, and labeled antibody, also with measurement of 
free or bound product-only one permits, in practice, the 
uae of antibodv concentrations anDroachine infinity. 



This particular approach may, for want of a better term, 
be described as ,t noncompetitive > , ' although it must be 
emphas i zed that such terminology involves a departure 
from the original meanings attached t<^ w competitive ,, 
and "noncompetitive" when these descriptions were first 
used in the present context Indeed, as discussed below, 
assays may be subclassified in this manner when no 
labeled reagent of any kind is involved. 

However, the categorization of immunoassays and 
other binding assays as competitive or noncompetitive, 
depending on the binding agent concentration yielding 
m a xim a l assay sensitivity, itself obscures the underly- 
ing reasons for the existence of this divergence in assay 
designs, and may thus be misleading. These reasons 
may be more readily understood if the basic principles of 
such assays are portrayed differently from their custom- 
ary presentation. 

The "Antibody Occupancy Principle" of Immunoassay 

When a "sensor" antibody is introduced into an ana- 
lyte-containing medium, binding sites on the antibody 
are occupied by analyte molecules to a fractional extent 
that reflects both the equilibrium constant governing 
the binding reaction, and the final concentration of free 
analyte present in the mixture. This proposition stems 
immediately from the Law of Mass Action, which can be 
written as'* 



[AbAgMfAb]=Jn£r\g] 



(1) 



or as fractional occupancy of antibody binding sites, 
given by 



[AbAgWAb] = KffAgVd + KffAgJ) 



(2) 



where [AbAg], [Ab], [fAb], and [fAg] represent the 
concentrations (at equilibrium) of bound and total anti- 
body, and free antibody and antigen (analyte), respec- 
tively, and K = equilibrium constant. The final concen- 
tration of free analyte generally depends on the concen- 
trations of both total analyte and antibody; however 
when total antibody approximates 0.Q5/JT or less, free 
and total antigen ([Ag]) concentrations do not differ 
significantly, and fractional occupancy of antibody is 
given by 



[AbAgMAb] = K[Ag]/(l + K[Ag]) 



(3) 



Assays utilizing this concept have been termed "am- 
bient analyte immunoassays" (16), fractional occupancy 
being independent of both sample volume and antibody 
concentration (Bee below). 

All immunoassays essentially depend on measure- 
ment of the 'fractional occupancy" of the sensor anti- 
body after its reaction with analyte (Bee Figure 4). 
Techniques relying on the measurement of unoccupied 
antibody binding sites (from which antibody occupancy 
is implicitly deduced by subtraction) necessitate— for 
attainment of maximal sensitivity— the use of sensor 
antibody concentrations tending to zero; these assayB 
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Fig. 4. The antibody binding-site occupancy principle of Immunoas- 
say 

All immunoassays impilcftiy reJy on the measurement of (fractional) bindino- 
site occupancy by ana/yte ; ^ 



may therefore be categorized as "competitive." Con- 
versely, techniques in which occupied Bites are directly 
measured permit (in principle) the use of relatively high 
concentrations of sensor antibody and may be described 
as ^oncompetitive. w This difference in assay design 
simply reflects the proposition that, to minimize error in 
the measurement, it is generally undesirable to mea- 
sure a small quantity by estimating the difference 
between two large quantities. 

These concepts are illustrated in Figure 5, which 
portrays basic immunoassay formats currently in com- 
mon use. Conventional RIA and other similar "labeled- 
analyte" techniques rely on measurement of unoccupied 
binding sites, generally by back-titration (either simul- 
taneous or sequential) with labeled analyte, but anti- 
idiotype antibody (reactive only with unoccupied sites 
on the sensor antibody) may be used for the same 
purpose. In the case of single-site labeled-antibody as- 
says, the labeled antibody itself constitutes the sensor 
antibody; after reaction with analyte, this sensor anti- 
body may be separated into occupied and unoccupied 
fractions through use of (e.g.) an immunosorbant (com- 
prising antigen, antigen analog, or anti-idiotypic anti- 
body linked to a solid support). If, after separation, the 
"signal" emitted by labeled antibody bound to analyte 
(i.e., the "occupied" fraction) is measured directly, the 
assay can be classed as "noncompetitive." Conversely, if 
one measures the labeled antibody not bound to analyte 
(i.e., that attached to the immunosorbant), then the 
assay is "competitive." 

Two-site "sandwich" assays are clearly more complex 
because they rely on two antibodies and can be consid- 
ered from two points of view. For our present purposes, 
the solid-phase antibody can be regarded as the "sensor*' 
antibody, with the labeled antibody enabling the occu- 
pied sensor-antibody binding sites to be distinguished. 
Seen from this viewpoint, two-site assays may be 
classed as "noncompetitive." 

These considerations emphasize that the differences 
in design distinguishing so-called competitive and non- 
competitive methods are essentially unrelated to which 
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Fig. 6. Basic competitive and noncompetitive immunoassay designs 
The distinction between noncompetitive and competitive immunoassays re- 
flects the way in which antibody binding-site occupancy la observed LatW 
antjbocy method* are "noncompetitive" il occupied stes ol the (labeled) 
anotody are dlrectfy measured, but are -competitive" {tower right) when 
unoccupwd *les are measured. Ubeled-antioen {tower toft) or labeled-anti- 
Kictypte-antibody methods (tower center) rety on measurement of sitae 
unoccupied by analyte, and are therefore ol "competitive" design 

component (if any) of the reaction system is labeled 
Indeed, in the case of transducer-based "immunosen- 
sors," no component is labeled; nevertheless, the design 
of the immunosensor will differ significantly, depending 
on whether a measurable signal is yielded by occupied 
or unoccupied antibody binding sites situated on its 
surface. In short, the terms "competitive" and "noncom- 
petitive" merely reflect alternative approaches to the 
determination of the occupancy of antibody binding 
sites and lead to differences in the optimal antibody 
concentration required to minimize the effects of ran- 
dom errors arising in the determination. 

Competitive and noncompetitive immunoassays can 
be shown to differ significantly in many of their perfor- 
mance characteristics, including their sensitivities. In 
both types of assays, both the affinity constant (K) of the 
antibody and the specific activity of the label are impor- 
tant in determining sensitivity; however, in practice, 
the sensitivity of competitive assays is primarily limited 
by the affinity constant of the antibody, whereas the 
specific activity of the label is more important in non- 
competitive systems. In both cases, the "experimental" 
or "manipulation" error in the measurement of the 
zenxiose response (RJ [i.e., the relative error (tx-fRJ 
arising from pipetting and other operations, but not 
including 'the statistical signal measurement error per 
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ee] ifl of key importance in determining "potential" 
assay sensitivily (i .e., the sensitivity obtained by assum- 
ing the specific activity of the label to be infinite, 
implying zero error in signal measurement). Thus the 
potential sensitivity of a competitive assay can be 
shown to be ov/KRo, whereas that of a noncompetitive 
assay is given by Ro<VtAb]KRo, where, in the latter 
case, Ro is assumed to represent the labeled antibody 
misclassified as bound ([hAbU commonly referred to as 
"nonspecifically bound" antibody. Thus Ity[Ab] - f, the 
fraction of labeled antibody that ifl nonspecifically 
bound, and Roo^/IAblKRo = fo^/KRo. Assuming that 
the relative error (o^/RJ in the measurement of the 
zero-dose response is approximately identical for both 
competitive and noncompetitive assays, it is evident 
from this simple analysis that the potential Benaitivity 
of noncompetitive methods is greater than that of com- 
petitive methods by the factor f, i.e., by the fraction of 
labeled antibody that ie "nonspecifically bound." For 
example, if the nonspecifically bound fraction is 0.01%, 
a noncompetitive strategy is potentially capable of a 
flenaitivity 10 000-fold greater than that of a competi- 
tive approach, other factors being equal. 

TbeBe findings are summarized in Figure 6 (left), 
which shows the relationships between sensitivily (ex- 
pressed in terms of molecules per milliliter) and anti- 
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Fig. 6. -meoreticaily predicted senstyvfties of competitive and norv 
competitive immunoassay methods (represented by the SO of zero 
analyte measurements, expressed as molecules/ml) plotted as a 
function of antibody affinity (K) ^ 
Note: in uncompetitive sandwich assays, the antibody affinity referred to Is 
tmtt the labeled antibody. In the competitive assays. calculations aret*sed 
on ^ asairnptcn that the experbnenta! error (CV) incurred In the rr*aW 
mem of me assay response (e-g., traction of labeled antiee* boundl is 1% The 
••potential sensttrvtty- curve assumes me use ot a label oMrrtwS aoedfc 
actwtor. Implying that the error in the meaaureri^ c< the label net ee kTreio 
The '"Hit* curve indicate* the toe In senaJtrrty 
amy incurred In counting ««f disintegnrions fori arte couKSe^ 
that 0 using entibodiee with an eMMty <1 0 « Unol (the^SaS&Jd ta 

specific adrvrry man ,sfi t. For noncompetitive essays, the potential aenrtrvrtv 

{ ^Z^l^^l ( /rr £u 7^:f f,d the Improved In 
• tUna * te °y """"nliing nonspecific bintfng-The com> 
aponorng "Habtf curves demonstrate the much greater loss In sensitivity 
(compared with thai potentially ettainable) when a radioisotopic marker is 

ueed, and the cpecial edventages ot nonisotople labels of hioher ssectfie 
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recced to &1%or tea). Arows Indicate assay eerSTrU^to 
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c^uscna underlay the original, development (ift 20) otT»rolved 
fiuorelmmunoassay (deloa). the first nonisotopic •'uttrwensibve- immunoas- 



body affinity in an optimized competitive (labeled ana- 
yte) assay. For this analysis, we assume (a) the use of a 
label of infinite specific activity, and (6) the use of l2 »I as 
a label, the radioactivity of the samples -being counted 
for 1 mm. Computations of the theoretically optimal 
reagent concentrations (on which calculations repre- 
sented in ...Figure 6 rely) were based on the further 
assumptiohs that (c) the radioactivity of the antibody, 
bound labeled-analyte fraction was counted and (<f) the 
(relative) "experimental error" component in the mea- 
surement of the bound fraction (©yb) was 1%. Given 
these assumptions, the "potential" sensitivity attain- 
able in such an assay is cyjfb, where K is the affinity 
constant of the antibody. [For example, if the affinity 
constant is 10" L/mol, and cyb is 0.01 (1%), maximal 
assay sensitivity is 10~ 14 mol/L, or ~6 x io e molecules/ 
mL.] The additional "signal measurement error" arising 
in consequence of counting radioactive samples for a 
finite time implies a loss of assay sensitivity, as shown 
by the upper curve in Figure 6 (left). However, the 
resulting loss in sensitivity is relatively small for anti- 
bodies of affinities <10 12 L/mol, and is negligible for 
antibodies with affinities <10" L/mol. In other words, if 
the essayist can accept individual sample counting 
times of 1-5 min, little improvement in sensitivity is 
gained bf using alternative labels of higher specific 
activities than 128 I. However, similar considerations 
suggest that radioisotopic labels of much lower specific 
activity than 125 I (e.g., tR) may limit the sensitivities of 
the assays (such as steroid assays) in which they are 
used, notwithstanding the use of relatively long sample 
counting times. 

The other main conclusions stemming from such 
analysis are the importance of both minimizing "manip- 
ulation" errors and using antibodies of high binding 
affinity. For example, an increase in oj/b to 3% implies 
an approximate threefold loss in sensitivity, notwith- 
standing the fact that an assay reoptimized in response 
to the deterioration in operator skill that these numbers 
imply would utilize less antibody and labeled analyte, 
thereby partially offsetting the consequences of poor 
pipetting. But the most important conclusion emerging 
from the analysis is the near impossibility, in practice, 
of achieving immunoassay sensitivities better than 
about 10 7 moleculea/mL by using a competitive ap- 
proach, irrespective of tie nature of the label used, if one 
assumes an upper limit to antibody binding affinities on 
the order of 10" L/mol. 

The results of a similar analysis of the sensitivity 
limitations applying to noncompetitive (two-site) assays 
(IS) are illustrated in Figure 6 (right). Two sets of 
curves are portrayed here, corresponding to the assump- 
tions of 1% and 0.01% nonspecific binding of labeled 
antibody to the capture-antibody substrate. Such anal- 
ysis likewise yields important conclusions relevant to 
assay design, e.g., the crucial importance of reducing 
nonspecific binding of labeled antibody to an absolute 
minimum. Furthermore, if nonspecific binding is re- 

An~A i„ ~0.m%. i„«t an hieh «en«itivitv ic «,.>,i wo kl» 
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by using an antibody of K = 10 8 L/mol in an optimized 
noncompetitive assay design as by using an antibody of 
K = 10" L/mol in a competitive method One of the most 
important conclusions is that the sensitivities poten- 
tially attainable with high-affinity antibodies UT >10 10 
L/mol) are beyond the reach of radioisotopically based 
methods, which (because of the relatively low specific 
activities of isotopes such as 128 I) are limited in practice 
to sensitivities of the order of 10 6 -10 7 molecules/ml or 
more. In short, although, under certain circumstances 
noncompetitive DiMAs may be somewhat more sensitive 
than corresponding RIA techniques (assuming the use 
of the same antibody in each methodology), the poten- 
tial advantages (m-d-wis sensitivity) of the noncompet- 
itive approach can be realized only by using nomotopic 
labels of much higher specific activity than 128 I The 
superiority of such labels is most apparent when" they 
are combined with high-affinity antibodies; however 

•S^* d ! mon f t l atefl th f *• ev «> with use of antibodies 
with affinities of about 10Mo» L/mol, nonisotopic labels 
may yield a substantial improvement in sensitivity 

These theoretical conclusions, together with the pub- 
lication by Kfihler and Milstein (1$) of methods of in 
vitro production of monoclonal antibodies (J), consti- 
tuted the basis of my laboratory's collaborative develop- 
ment (initiated around 1976) with the instrument ma£ 
ufacturer LKB/Wallac of the time-resolved fluorometric 
unmunoassay methodology now known as DELFIA (19 
20). This methodology was the first "ultra-sensitive" 
nonisotopic immunoassay methodology to be developed 
The same basic approach has subsequently been 
adopted by many other manufacturers, using a variety 
of high-specific activity labels (Table 1). 

Against this background, let us now turn to the 
development of highly sensitive, ininiaturized "micro- 
spot" immunoassays and multianalyte assay systems. 

and , T?Ieory M,C,0 * POt •"wiunoassay: Basic Concepts 

Ambient Analyte Immunoassay 

Particular attention has been drawn above to the 
specious notion that an antibody concentration approx- 
imating 0.5/K is required to maximize the sensitivity of 
conventional labeled-ahtigen assays. This proposition is 
implicitly overturned by the development of "microspor 
immunoassays, which we expect to provide the basis of 
a new generation of binding assay methods. But before 



discussing this methodology in detail, another basic 
analytical concept must be examined. 

The recognition that all immunoassays essentially 
rely on measurement of antibody occupancy leads to a 
potentially important type of assay, ambient analyte 
immunoassay (16). This name is intended to describe 
assay systems that, unlike conventional methods mea- 
sure the analyte concentration in the medium to which 
an antibody is exposed, being independent both of sam- 
pie volume and of the amount of antibody present The 
possibility of developing such assays follows from the 
Law of Mass Action, which leads to the following equa- 

St*' 7 P ^ tin 5 occupancy (F) by ana- 

iyte of antibody binding sites (at equilibrium): 

F 2 - F{(l/[Ab]) + ([AnMAbl) + 1} + [An]/[Ab] = 0 ( 4 ) 

where. [An] = analyte concentration, [Ab] = antibody 
concentration (both in units of 1/K). 1 

From this equation it may readily be shown that, for 
antibody concentrations approaching D, F = [AnJ/(l + 
IAjsJ) This conclusion is illustrated in Figure" 7, in 
which the fractional occupancy of ("monospecific" or 
"monoclonal") antibody binding sites in the presence of 
various analyte concentrations is plotted against anti- 
body concentration. When an antibody concentration of 
less than (say) O.OVK (the antibody preferably, but not 
essentially, being coupled to a solid support) is exposed : 
to an analytew»ntaining medium, the resulting (frac- 
bonal) occupancy of antibody binding sites solely re- 
fleets the ambient concentration of analyte" and is 
^dependent of the total amount of antibody in the 
system. (If, for example, K = 10" l/mol, an antibody 
bnidrng-dte concentration of 0.01/JT represents 0.01 x 
10 » mol/L, or 6.02 x l 0 7 binding sitea/mL.) Analyte ' 
binding by antibody causes depletion of (unbound) ana- 
lyte in the medium but, because the amount bound is 
small the resulting reduction in the ambient concentra- 
tion of analyte is insignificant For example, if the 
concentotion of bmding sites of the sensor antibodies is 
<0^1/A . analyte depletion in the medium is invariably 
<1%, and the system is therefore effectively indepen- 



Table 1. Detection Limits According to Type of Label 
ias Uw Specific actMty 

1 1 detectable event per second per 

7.5 x 10 4 labeled molecules 

Enzyme label Determined by enzyme, "amplifica- 

tion factor" and detectability of 
reaction product 

Chemiluminescent label 1 detectable event per labeled 



Fluorescent label 



molecule 

Many detectable events per labeled 
molecule 



4 , 1 EjPjrwMon of reagent concentrations in terms of VK units has 
a^y data. The terms [Ab] and [An] are underlined to indicate 
5fev^«f nt i° n h r T ^ D adh ^ * * derivfnVequation ? 
able with [Ab] and [An]. For example, if the antibody Dosses^*™ 

Xu. j££?P ^ m '""to of l/X) is i (dimenaionlesa) unit 
Thus, agonal occupancy curve, bawd on equation 4 are ioW 
cal for ^ anybodies if this way of expmeinTantibodv ScenbS- 

D an««fl te JS»l aabie !!i" iS U8ed * tba* antibody occu- 

pancy reflects the analyte concentration to which aotib^Kndn* 
sites are exposed, not the amount of analyte in the inSSontuhe 
i.e.. the system is independent of sample volume. UOn 
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Fig. 7. Fractional antibody binding-she occupancy (F ( see equation 
4) plotted as a function of antibody bindinp-slte concentration for 
different values of anaiyte (antigen) concentration (—), and the 
percentage binding (b) of anaiyte to antibody {right-hand ordinate; - 

All concentrations are expressed In unto of 1//C Note that for amibodv 
concentrations <0X1/K(appro»matery) ( the percentage binding of anaiyte Is 
<1* for all anaiyte cx>x*mtratione 1 and fractional Wrxilrv^ occupancy m 
essentially unaffected by variations In antibody concentration extending over 
several orders of magnitude, being governed solely by antigen concentration 
(ambient anaiyte immunoassay). Note that radioimmunoassays and other 
"competitive" immunoassays are conventional designed to use antibody 
concentrations approximating O&K-VKor more fimprying binding of artaMe 
c^ntrattons tending to zero (bo) >30%), in accordance with the precepts of 
Taww anc Hereon (i, 3) 



dent of sample volume. 

These conclusions lead to two further concepts. First, 
the antibody may be confined to a "microspot" on a solid 
support, such that the total number of antibody binding 
sites within the microspot is <vlK x 10" 8 x N, where v 
- the sample volume to which the microspot is exposed 
(in milliliters) and JV = Avogadro's number (6 x 10 23 ) 
For example, if v = 1 and K = 10" Ltaol. then the 



maximum number of binding sites that will cause neg- 
ligible disturbance «1%) to the ambient concentration 
of anaiyte is 6 x 10 6 , this number being greater for 
lower-affinity antibodies. Furthermore, thejKsrception 
that the ratio of occupied (or unoccupied) sites to total 
binding sites is solely dependent on the ambient concen- 
tration of anaiyte leads to the concept of a dual-label, 
"ratiometric," microspot immunoassay. ' 

Dual-Label Microspot Immunoassay 

After exposure of a microspot of antibody (located on a 
suitable probe) to an analyte-containing fluid (see Fig- 
ure 8, left), the probe may be removed and exposed to a 
solution containing a high concentration of a "develop- 
ing" antibody directed against either a second epitope 
(i.e., the occupied site) on the anaiyte molecule if the 
molecule is large, or against unoccupied binding siteB on 
the antibody in the case of small anaiyte molecules 
(Figure 8, right). The fractional occupancy of the sensor 
antibody may thus be estimated by measuring the ratio 
of sensor and developing antibodies that form the dual- 
antibody "couplets." This can be readily achieved by 
labeling the sensor and the developing antibodies with 
different labels, e.g., a pair of radioactive, enzyme, or 
chemiluminescent markers (or even labels of entirely 
different nature). Fluorescent labels are potentially par- 
ticularly useful in this context because, by the use of 
optical scanning techniques (Figure 9), they permit the 
scanning of arrays of antibody "microspots" distributed 
over a surface (each microspot directed against a differ- 
ent anaiyte), so that multiple anaiyte assays may be 
performed simultaneously on the same sample. Several 
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advantages stem from adopting a dual fluorescence 
measurement. For example, neither the amount nor the 
ojsb-ibution of the sensor antibody within the detector's 
field of view is important, because the ratio of the 
emitted fluorescent signals is unaffected. Likewise, fluc- 
tuations in the intensity of the incident (exciting) lieht 
beam are apt to be of little significance. ThesTadvan- 
tages are additional to the basic benefit stemming from 
this approach, i.e., that the necessity of ensuring con. 
stancy of the amount of sensor antibody used in the 
assay system is removed. 

Microspot Immunoassay Sensitivity 

Because the microspot immunoassay methodology 
challenges concepts that have dominated immunoassay 
design theory in the past two to three decades, consid- 
eration of the potential sensitivity attainable by this 
approach is obviously of primary importance. The prop- 
osition that microspot assays may be at least as sensi- 
tive as conventional systems that rely on far larger 
amounts of antibody may readily be demonstrated by 
consideration of a model system. Let us postulate thai 
sensor antibody molecules are attached to the surface of 
a solid support such that their binding sites remain 
exposed to the analyte, and that their affinity for the 
analyte is thereby unchanged. (The antibody concentra- 
tion in the system-the number of binding sites on the 
support divided by the incubation volume— is unaffected 
by such attachment, and antibody occupancy by analyte 
at equibbnum will be identical to that occurring if £e 
antibody is distributed uniformly throughout the incu- 
bation mixture.) Let us also suppose that the antibody 
molecules exist as a uniform monolayer. of maxima] 
surface density on the support and (to simplify discus- 
sum) are unlabeled. Then a change in the concentration 

of sensor antibody implies a corresponding change in 
the surface area over which the antibody is distributed 
If, for example, the antibody affinity constant is 10" 
L/mol, the total incubation volume is 1 mL, and the 
antibody surface density is 6000 binding site&W then 



8 ^fv^ ° f 1( * m2 (U " °- J accommodates 
antibody binding sites corresponding to a concentration 

° f 001 W Wrre8 P°^ to a concen. 
tiationof0.01/^ e te.Utusrartheri)()stulatethat after 
exposure of the sensor antibodies to a medium contain, 
ing analyte at a concentration of 0.01/BT (i.e„ 6 x 10 7 
molecules/ml), we measure "noncompetitively" the re- 
sulting^tdbody occupancy (e.g., by exposure to a sec- 
oncl labeled, "developing- antibody directed against the 
analyte, forming a typical antibody sandwich). Finally 
let us suppose that all occupied sites react with the 
developing ; antibody, with the latter also, binding -non- 
speofically- to the solid support itself at a Lface 
density of 1 molecule//an 3 . 

We may now consider the effects of a progressive 
reduction of the antibody-coated surface area from (e.g ) 

I i*™ i^ffS? ^ concentration ltK) through 
0.1 mm' (0 17*) to 0.01 mm 2 (0.01/K) and below. From 
equation 4, the value of F for the 1 mm 2 area is 4 98 x 
10 Thus at equilibrium the number of analyte and 
labeled antibody molecules, specifically bound to the 
area is 2.99 x 10 7 (i.e., about 50% of the total analyte 
molecules present), whereas the number of labeled an- 
tibody molecules nonspecifically bound is 10 6 Thus 
assuming the field of view of the detecting instrument is 
restricted to the area on which the sensor antibody is 
deposited (see Figure 10a), and (provisionally) assuming 
the background (or "noise") of the instrument itself to be 
zero (i.e., the only source of background is the non- 
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specifically-bound labeled antibody within the instru- 
ment's field of view), the signal/noise ratio observed for 
the 1 mm 2 area is -30. Similarly, the value of F for a 0.1 
mm 2 area is 9.02 x 10~ s , the number of labeled anti- 
body molecules specifically bound to the area is 5.41 x 
10 6 , the number nonspecifically bound is 10 6 , and the 
signal/noise ratio is -54. Likewise, the signal/noise 
ratio for a 0.01 mm 2 area can be shown to be -59. In 
short, the signal/noise ratio increases as the antibody- 
coated surface area is decreased, approaching a maxi- 
mal (plateau) value of 60 as the area coated with sensor 
antibody falls below 0.01 mm 2 and tends toward zero. 

If, however, a reduction in the antibody-coated area 
were not accompanied by a corresponding reduction in 
the detecting instrument's field of view, the resulting 
reduction in "signal" would not lead to a corresponding 
decrease in the background generated by nonspecifi- 
cally-bound developing antibody (Figure 106). There- 
fore, although reduction in the coated area would in- 
crease the fractional occupancy of the sensor antibody, 
the signal/noise ratio might either remain constant or 
fall. In these circumstances it might be advantageous to 
increase the coated area. Similarly, if the surface den- 
sity of sensor antibody were decreased (the coated area 
being held constant), similar conclusions would be 
reached (Figure 10c). 

likewise, if the background signal generated within 
the detecting instrument itself (e.g., from the photocath- 
ode of a photomultiplier tube used to detect photons 
emitted from the antibody-coated area) were not zero, 
and remained constant regardless of the instrument's 
field of view, then a Tnarimum signal/noise ratio would 
also be attained at some optimal value of the antibody- 
coated area, below which the ratio would fall. Because, 
however, one can generally reduce the size of the detector 
(and hence the detector-generated background) at the 
same rate as the size of the signal-emitting area, there is 
no reason— in principle— for the signal/noise ratio to 
diminish as the .antibody-coated area is progressively 
reduced toward zero. Thus if we.accept the signal/noise 
ratio as indicative of the precision of the measurement of 
antibody occupancy (and hence of assay sensitivity), 
these considerations suggest that it is advantageous to 
reduce the antibody-coated surface area (and, concomi- 
tantly, the sensor-antibody concentration) toward zero, 
although little advantage is likely to accrue from reduc- 
ing the area below 0.01 mm 2 (and thus the antibody 
concentration below 0.0VK). 

Were the microspot area indeed reduced to zero, both 
signal and noise would likewise also fall to zero (the 
ratio between them nevertheless remaining essentially 
constant), implying that no signal of any kind would, in 
the limit, be recorded- In practice, other statistical 
factors come into play when the number of individual 
events (e.g., photons) observed by a detecting instru- 
ment is very low, thus prohibiting a reduction of the 
sensor antibody concentration to zero. The point at 
which the reduction in the antibody-coated area causes 
tb#» rfotorrtahlA riptiaI tn be lost sufficiently to affect the 



precision of the measurement of antibody occupancy 
depends clearly on the specific activity of the labeled 
antibody UBed to measure the occupied binding sites: the 
higher the specific activity, the smaller the-permisaible 
area. Thus, given labels of very high specific activity, 
one can envision circumstances in which, even in a 
n noncompetitive n system, the optimal concentration of 
sensor antibody may be exceedingly low. A more gen- 
eral conclusion is that a variety of factors, including the 
characteristics of the instruments used for measuring 
the labeled antibody (or labeled analyte), influence 
immunoassay design, implying, among other things, the 
virtual impossibility of formulating general rules re- 
garding this. For example, reagent concentrations that 
are optimal for isotopically labeled reagents used with a 
conventional radioisotope counter (possessing a fixed 
background dependent on its basic construction) are 
likely to be entirely different when very high-specific- 
activity labels are used and one has the freedom to tailor 
the measuring instrument to samples of any size. In 
short, certain conclusions based on experience of RIA 
and IRMA techniques may prove misleading when ap- 
plied to nonisotopic methodologies, and should be 
viewed with caution. 

A more detailed theoretical consideration of (noncom- 
petitive) microspot immunoassay sensitivity (21) sug- 
gests that 

= D*^ x [(6 x lO^d + [Ab*])}/Diit[Ab»] (5) 

where D = surface density (binding sites/fan 2 ) of sensor 
antibody, K = sensor antibody affinity (L/mol), [Ab*] = 
concentration of labeled antibody in developing solution 
(expressed in units of UK* 3 where K* = labeled antibody 
affinity), D* mi „ = minimum detectable surface density 
of labeled antibody (molecules/fun 2 ), and C min = assay 
detection limit (molecules/mL). For example, if [Ab*] *= 
1, D = 10 6 molecules//un 2 , K = 10" L/mol, and D*^ = 
20 molecules//im 2 , then = 2.4 x 10° molecules/mL 
- 4 x 10" ia mol/L and the fractional occupancy of the 
binding sites of the sensor antibody by the minimum 
detectable concentration of analyte is 0.04%. Figure 11 
shows the theoretical assay sensitivities attainable with 
use of sensor antibodies of various affinities, plotted as a 
function of Z)*^. 

A similar theoretical analysis of competitive micro- 
spot immunoassay indicates that potential sensitivities 
are essentially identical to those attainable with con- 
ventional competitive methodologies. In summary, the 
above considerations indicate that the attainment of 
high microspot aaaay sensitivity requires close packing 
of molecules of sensor antibodies within the microspot 
area, combined with the use of an instrument capable of 
accurately measuring very low surface densities of de- 
veloping antibodies. They also suggest that (a) micro- 
spot assay sensitivities considerably higher than those 
obtainable by conventional isotopically based immu- 
noassays are achievable, and (6) if labels of very high 
specific activity are available, the sensitivities yielded 
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by microspot assays are unlikely to be inferior and 
depending on the characteristics of the measuring in- 
"tnnne*. used) could be superior to the sensitivities 
achievable in macroscopic assays of conventional de- 

; Filially, we briefly address a further question occa- 
sionaUy raised in this context, i.e., the kinetic charac 
tenstics of microspot assays. Two points should be made 
regarding to. issue. First, the smaller the microspot of 
sensing anUbody, the lower the diffiision constramVon 
? ♦ antibody/analyte binding reaction, so 
that at the . hunt (i.e., when the amount of antibody 

W^T^ 1 * ^Zf™**** approaches zero) the 
kinetics of the reaction approximate those observed in a 
homogeneous liquid-phase system. Second, although the 
effecfave concentration of sensor antibody in the incuba- 
tion medium is exceedingly low, the fractional rate at 
which sensor antibody binding sites within the micro- 
spot become occupied is invariably greater in this cir- 
ciunstance than when a relatively high concentration of 
antibody lfl used, as in conventional assays, particularly 
those of noncompetitive design. In other wor^beaS 
m mind the relationship between fractional occupancy 
of sensor antibody and the signal/noise ratio dS^ssed 
above it is readily demonstrable that the raTa^S 

til SSS" W F*^* When 016 area (and 

theanbbody contained within it) is least. Thus, given 
instrumentation whose field of view is restricted to the 
nucrospo area, the highest signal/noise ratio will be 
observed after any selected incubation peric*) when the 

KG^TW* 1 * 0 * aDtib0dy in the ostein is 
nr^ff " £ °U' e ° Btrn ? 1 Perhap8 * ««Perncial im- 
pression, and to the generally accepted belief that short 
immunoassay incubation times require the use of verv 
large amounts of antibody, the antibody microspot ap. 



^, provides baas of assays potentially mors 
rapid than any currently available. 

Microspot Immunoassay: Some Practical Considerations 

Although various high-specific-activity antihodv l« 
bels are potentially usable in this contort SSi? 

nuorophors. The simultaneous measurement of rin.i 
fluorescences from smaU areas is, tfcS wel es££ 

bon(e.g the laser sca^ 

not speoficaUy designed for the present SS. bS 

^^^^^^ 
Wfeam the fluorescence photons emTtted froTtiS 
area being focused in turn onto a detector, typSH 
Zt 1 -TT 1 ^J™**"* (22, 23). Atthe "in! 
focal point the projection of the Rumination pinhole 

^gure 12). Fluorescence photons emitted at other 
points thus possess a low probability of reaching tfe 
detector. Such systems contrast with convent3 «£ 
fluorescence micr^pes, in which the specimen is ex- 
posed to an essentially uniform flux of fflinSS and 
yield much sharper images of fluorescentSrTS 
atod in a defined plane of a tissue sample. wZSl 
s^ntoneous^ emitted by the. photomultiplierX^ 
cathode contribute to the background signal ff£ 
^trument, and must-for highest microspoTeLy set 
sitmty-be minimized. Fortunately, the dL^lf^ 
instruments permits the photocathede to be v^y sS 
in area, and this source of background can be expected 
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to diminish with future improvement in photomultiplier 
design. Other sources of background include fluores- 
cence emitted by components in the optical system, 
which may not, in current instruments, have been 
constructed with background reduction as a prime con* 
sideration. Nevertheless, they detect with high sensitiv- 
ity fluorescent signals. For example, one commercially 
available microscope is claimed to detect fluorescein at a 
density of 10 molecules/^un 2 . Most commercially avail- 
able fluorescein isothiocyanate (FITC)-]abeled IgG ex- 
hibits a fluorophor/protein ratio of —4; this implies 
detection limit (D* min ) for antibody surface density of 
two or three FITC-labeled IgG molecules per microme- 
ter 2 . This, in turn, implies a theoretical sensitivity for a 
two-site immunoassay of -2-3 x 10 5 analyte molecules 
per millili ter, assuming identical parameter values as 
above, or 2-3 x 10 4 molecules/mL if the sensing anti- 
body has an affinity of 10 12 L/mol. Clearly, sensitivity 
may be increased by loading more fluorophor either 
directly or indirectly onto the antibody. 

Our preliminary studies have relied on a less sensi- 
tive microscope, albeit one possessing facilities for dual- 
fluorescence measurement Its argon laser emits two 
excitation lines at 488 and 514 nm. It is thus particu- 
larly efficient in exciting blue/green-emitting fluoro- 
phores such as FITC (excitation maximum 492 nm), but 
is less efficient in exciting fluorophores such as Texas 
Red (excitation maximum 596 nm). However, the ratio- 
metric assay principle permits considerable variation in 
detection efficiencies of the two labels because the spe- 
cific activities of the labeled antibody species forming 
the antibody couplets can be chosen to yield signal 
ratios approximating unity. Inefficiency of the argon 
laser in exciting Texas Red is thus not a major handicap 
in this context. Though this instrument relies on a 
conventional microscope and not on an optical system 
designed for this purpose (and thus implicitly less sen- 
sitive), it permits quantification of fluorescence signals 
generated from microspota of any selected area. Initial 
studies have revealed that, under conditions that are 
not optimal, the instrument is'capable of detecting -25 
FITC-labeled and (or) 150 Texas Red-labeled IgG mole- 
cules per micrometer 2 , while scanning an area of -50 
Mm 2 . 

The development of microspot immunoassays has also 
necessitated closer scrutiny of the mechanisms involved 
in the coupling, of antibodies to solid supports. In the 
present context, these should display a capacity to 
adsorb (in the form of a monolayer)— or to covalently 
link — a high surface density of antibody combined with 
low intrinsic-signal-generating properties (e.g., low in- 
trinsic fluorescence), thus minimizing background. We 
have examined a number of candidate materials, such 
as polypropylene, Teflon*, cellulose and nitrocellulose 

membranes, microtiter plates (clear polystyrene plates; 
black, white, and clear polystyrene plates), glass slides 
and quartz optical fibers coated with 3 -(amino propyl) 
triethoxy silane, etc., and several alternative protocols 
for achieving high monolayer coating densities. These 



studies have exposed phenomena neither evident nor of 
importance when antibody binding to solid supports is 
examined at a macroscopic level. Provisionally, we have 
used white Dynatech Microfluor microtiter plates- 
formulated for the detection of low fluorescence signals, 
and yielding high signal/noise ratios and high coating 
densities of functional antibodies (-5 x 10 4 IgG mole- 
cules/Mxn^for assay development, although such 
plates are not ideal. Indeed, deficiencies in the antibody- 
deposition methods used constitute the principal source 
of imprecision in assay results and the limitation in 
sensitivity that this implies. Clearly, this represents an 
area for further study and refinement of current coating 
techniques. 

Notwithstanding the limitations of present instru- 
mentation (which, among other things, does not permit 
the use of time-resolving techniques to distinguish two 
individual fluorescence signals either from each other or 
from background fluorescence) and the crudenese of 
present methods for coupling antibodies onto small 
areas, we have verified the theoretical concepts outlined 
above by comparing the performance of several assays 
when constructed in microspot format and when conven- 
tionally designed. Although unoptimized, ratiometric 
microspot assays have yielded sensitivity values closely 
approaching those of conventional optimized IRMA. As 
an example, the results of a ratiometric assay system for 
thyrotropin, with use of Texas Red- and FITC-labeled 
antibodies, are shown in Figure 13. Bearing in mind the 
well-known limitations of these and other "convention- 
al" fluorophors when used as immunoassay reagent 
labels, such results are encouraging, although further 
work is clearly required to achieve the considerably 
greater sensitivity theoretically predicted with use of 
improved fluorophors, better antibody-mi crospotting 
techniques, and purpose-built (time-resolving) instru- 
mentation. 

The finding that highly sensitive immunoassays can 
be performed with far smaller amounts of antibody than 
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are currently used conventionally permits in turn the 
construction of antibody microfipot arrays enabling, in 
principle, the simultaneous measurement of thousands 
of different substances in 1-mL samples. In collabora- 
tion with investigators at the Centre for Applied Micro- 
biological Research, Porton Down, U.K., we are pres- 
ently developing various techniques for the creation of 
such arrays. Indeed, similar technologies have recently 
been used for the parallel synthesis of several different 
polypeptides, these enabling 10 000-microspot arrays to 
be constructed on silica chips approximating 1 cm 2 (24). 
Although arrays of this capacity are unlikely .to ever be 
required for conventional diagnostic purposes, we can 
anticipate that the ability to simultaneously measure 
many substances in the same sample will have revolu- 
tionary consequences in medicine and other similar 
areas. In addition, such techniques may ultimately 
permit the individual analysis of the multiple isoforms 
of certain "heterogeneous" analytes (e.g., the glycopro- 
tein hormones), such molecular heterogeneity currently 
presenting a major obstacle to the standardization and 
interpretation of many immunological measurements 
(25). Moreover, although these concepts have been illus- 
trated in an immunoassay context, they are clearly 
applicable to all "binding assays," including those rely- 
ing on the use of DNA probes,, hormone receptors, etc. 
For example, labeled lectins that are specific in their 
reactions with the sugar residues in the oligosaccharide 
chains of glycoprotein molecules may be used, together 
with specific antibodies, to impart additional "structural 
specificity" to sandwich assays (26, 27), possibly over- 
coming the limitations of antibodies per se in regard to 
differentiation of the glycosylation variants of the gly- 
coprotein hormones. 

Summary and Conclusion ' 

Because of past confusion regarding the concepts of 
precision, sensitivity, accuracy, etc., several erroneous 
concepts have become incorporated within currently 
accepted rules of immunoassay design. In particular, 
much higher antibody concentrations are customarily 
used than are necessary to achieve very high assay 
sensitivity, provided that certain measurement strate- 
gies are adhered to. In this presentation, we have 
attempted to show that, in principle, the highest assay 
sensitivities are obtained by confining a small number 
of sensor antibody molecules onto a very small area in 
the form of a microspot and measuring their occupancy 
by an analyte, by using very high-specific-activity "de- 
veloping" antibody probes, thereby maximizing the sig- 
nal/noise ratio in the determination of sensor antibody 
occupancy. This observation, which contradicts cur- 
rently accepted immunoassay design theory, in turn 
makes possible the measurement of an unlimited num- 
ber of different analytes on a chip of very small surface 
area through the use of, e.g., laser scanning techniques 
closely analogous to those used in compact disk tech- 
niques of sound recording. Extensive experimental stud- 
ies in this area, albeit conducted with relatively crude 
techniques and instrumentation not specifically de- 



signed for these purposes, and therefore not reported in 
detail here, have demonstrated the feasibility of the 
miniaturized antibody microspot approach and the va- 
lidity of the general concepts on which it is based. We 
are therefore confident that this represents the basis of 
a next-generation technology that is likely to have a 
revolutionary impact on all fields involving the use of 
binding assays. 
** 
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Corrections 



Vol 37, pp. 1447-8: In our desire for rapid publication, 
important errors were introduced into the following 
Technical Brief. The corrected version is here repro- 
duced in its entirety, with our apologies to the authors. 

Rapid Detection of 1 71 7-1 G-^ A Mutation In CFTR Gene 
by PCR-Mediated Site-Directed Mutagenesis, Laura 
Cremonesi/ Manuela Seia, s Carmelina Magnani/ and 
Maurizio Ferrari 1 0 Istituto Scientific© H.S, Raflaele, 
Lab. Centrale, Milano; 2 Istituti Clin, di Perfezionamento, 
Lab. di Ricerche Clin., Milano, Italy) 

Until now, among the non-AF508 mutations identified in 
the cystic fibrosis transmembrane conductance regulator 
(CFTR) gene by the Cystic Fibrosis (CF) Genetic Analysis 
Consortium, the ones most frequently seen in our popula- 
tion sample are the 1717-lG-^A mutation (13/144 or 9% of 
the CF chromosomes) and the G542X mutation (16/190 or 
8.4% of the CF chromosomes), both revealed by dot-blot 
hybridization of the polymerase chain reaction (PGR) prod- 
uct with allele-specific oligonucleotides (ASO) probes (J). 

In an attempt to simpl ify t he analysis of the most 
frequent mutations in the CFTR gene, we converted radio- 
labeled ASO detection into restriction endonuclease anal- 
ysis of tbe amplified product 

A PCR-mediated site-directed mutagenesis (2, 3) to de- 
tect the G542X mutation by generating a novel BstNl site 
in the wild-type sequence had already been suggested (4). 

To detect the 1717-1G->A mutation, we designed the 
reverse primer (5'-CTCTGCAAACTTGGAGA^TC-3') to 
contain a single-base mismatch (T-+G), which could create 
a novel A will restriction site [G J G(A/DCC] in the am- 
plified wild-type (WT) allele but not in the CF mutant (M) 
allele: 
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Fig. 1. Detection of the 1717-1G-A mutation by PCR 

Reactions were carried out with 1 mo erf genomic DNA In a total volume of 100 
ML containing 10 mmol/L Trie • HO (pH 8.3), 60 mmol/L KO, 1.6 mmol/L 
MoCI?. 0.1 g/L gelatin, 200 fonoVL each of the tour cfeoxyribornideoltte 
triphosphates, 2.6 unite of Taq polymerase (PenarvDmar Cetus, Norwaik. 
CT), and 100 pmof of each of the primers. PCR conditions were as follows: 
denaturation at 94 *C fori min. annealing at 55 *C tor 30 s, and extension at 
72 "C tor 1 mln, for a totaJ of 30 cydea. PCR products were digested tor 2 h at 
37 °C with 5 U otAvaU and electnDphoresed on 3% egarose-1% NuSleve gel 
tor 1 h at 50 V. Bands were made visfcte by staining the gel with ethidium 
bromide. Lane 1: Hadlkflgested p6R322 size manner. Lane 2: normal 
homoiygote. Lane 3. CF patient homozygous for the 1717-1 A mutation. 
Lane 4; heterozygote carrier tor the 1 71 7-1 G-»A mutation 



For the forward primer, we used the one made available 
by the CF Genetic Analysis Consortium to amplify exon 11 
of the CFTR gene: 5 ■ -C AACTGTGGTTAAAGC AAT- 
AGTGT-3'. 

Digestion by A volk enzyme of the PCR product generates 
two fragments of 116- and 21 -bp in the wild-type alleles 
and leaves undigested a 137-bp fragment in the mutant 
alleles (Figure 1). 

By combined analysis for the AF508 mutation (5) (252/ 
470 or 53.6% of the CF chromosomes), 1717-1G-*A, and 
G542X, about 71% of mutations might be detected by 
nonisotopic analysis of the PCR product, thus allowing a 
faster and easier one-day procedure for carrier screening 
and prenatal testing. 
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differentially expressed genes in healthy and diseased subjects 

Cross Reference to Related Applications: 
5 This application is a continuation-in-part application of U.S. Serial No. 

08/195,485 filed February 14, 1994, the contents of which are incorporated herein by 
reference. 

Field of the Invention 

10 The present invention relates to the use of immobilized 

oligonucleotide/polynucleotide or polynucleotide sequences for the identification, 
sequencing and characterization of genes which are implicated in disease, infection, 
or development and the use of such identified genes and the proteins encoded thereby 
in diagnosis, prognosis, therapy and drug discovery. 

15 

Background of the Invention 

Identification, sequencing and characterization of genes, especially 
human genes, is a major goal of modern scientific research. By identifying genes, 
determining their sequences and characterizing their biological function, it is possible 

20 to employ recobinant DNA technology to produce large quantities of valuable "gene 
products", e.g., proteins and peptides. Additionally, knowledge of gene sequences 
can provide a key to diagnosis, prognosis and treatment of a variety of disease states 
in plants and animals which are characterized by inappropriate expression and/or 
repression of selected gene(s) or by the influence of external factors, e.g., carcinogens 

25 or teratogens, on gene function. The term disease-associated genes(s) is used herein 
in its broadest sence to mean not only genes associated with classical inherited 
diseases, but also those associated with genetic predisposition to disease as well as 
infectious or pathogenic states resulting from gene expression by infectious agents or 
the effect on host cell gene expression by the presence of such a pathogen or its 

30 products Locating disease-associated genes will permit the development of 
diagnostic and prognostic reagents and methods, as well as possible therapeutic 
regimens, and the discovery of new drugs for treating or preventing the occurrence of 
such diseases. 

Methods have been described for the identification of certain novel 
35 gene sequences, referred to as Expressed Sequence Tags (EST) [see, e.g., Adams et 
al, Science . 252:1651-1656 (1991); and International Patent Application No. 
WO93/00353, published January 7, 1993]. Conventially, an EST is a specific cDNA 
polynucleotide sequence, or tag, about 150 to 400 nucleotides in length, derived from 
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a messenger RNA molecule by reverse transcription, which is a marker for, and 
component of, a human gene actually transcribed in vivo. However, as used herein an 
EST also refers to a genomic DNA fragment derived from an organism, such as a 
microorganism,the DNA of which lacks intron regions. 
5 A variety of techniques have been described for identifying particular 

gene sequences on the basis of their gene products. For example, several techniques 
are described in the art [see, e.g., International Patent Application No. W09 1/07087, 
published May 30, 1991]. Additionally, known methods exist for the amplification of 
desired sequences [see, e.g., International Patent Application No. W09 1/17271, 

10 published November 14, 1991, among others]. 

However, at present, there exist no established methods for filling the 
need in the art for methods and reagents which employ fragments of differentially 
expressed genes of known, unknown (or previously unrecognized ) function or 
consequence to provide diagnostic and therapeutic methods and reagents for diagnosis 

15 and treatment of disease or infection, which conditions are characterized by such 
genes and gene products. It should be appreciated that it is the expression differences 
that are diagnostic of the altered state (e.g., predisease, disease, pathogenic, 
progression or infectious). Such genes associated with the altered state are likely to 
be the targets of drug discovery, whether the genes are the cause or the effect of the 

20 condition, identification of such genes provides insight into which gene expression 
needs to be re-altered in order to reestablished the healthy state. 

Summary of the Invention 

In one aspect, the invention provides methods for identifying gene(s) 

25 which are differentially expressed, for example, in a normal healthy organism and an 
organism having a disease. The method involves producing and comparing 
hybridization patterns formed between samples of expressed mRNA or cDNA 
polynucleotide sequences obtained from either analogous cells, tissues or organs of a 
healthy organism and a diseased organism and a defined set of 

30 oligonucleotide/polynucleotide/polynucleotide sequence probes from either an 
healthy organism or a diseased organism immobilized on a support. Those defined 
oligonucleotide/polynucleotide sequences are representative of the total expressed 
genetic component of the cells, tissues, organs or organism as defined the collection 
of partial cDNA sequences (ESTs). The differences between the hybridization 

35 patterns permit identification of those particular EST or gene-specific 
oligonucleotide/polynucleotide sequences associated with differential expression, and 
the identification of the EST permits identification of the clone from which it was 
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derived and using ordinary skill further cloning and, if desired, sequencing of the full- 
length cDNA and genomic counterpart, i.e M gene, from which it was obtained. 

In another aspect, the invention provides methods substantially similar 
to those described above, but which permit identification of those gene(s) of a 
5 pathogen which are expressed in any biological sample of an infected organism based 
on comparative hybridization of RNA/cDNA samples derived from a healthy versus 
infected organism, hybridized to an oligonucleotide/polynucleotide set representative 
of the gene coding complement of the pathogen of interest 

In another aspect, the invention provides methods substantially similar 

10 to those described above, but which permit identification of those ESTs-specific 
oligonucleotide/polynucleotide sequences of host gene(s) which represent genes being 
differentially expressed/ altered in expression by the disease state, or infection and are 
expressed in any biological sample of an infected organism based on comparative 
hybridization of RNA/cDNA samples derived from a healthy versus infected 

15 organism of interest. 

In a further aspect, the methods described above and in detail below, 
also provide methods for diagnosis of diseases or infections characterized by 
differentially expressed genes, the expression of which has been altered as a result of 
infection by the pathogen or disease causing agent in question. All identified 

20 differences provide the basis for diagnostic testing be it the altered expression of 
endogenous genes or the patterned expression of the genes of the infecting organism. 
Such patterns of altered expression are defined by comparing RNA/cDNA from the 
two states hybridized against a panel of oligonucleotide/polynucleotides representing 
the expressed gene component of a cell, tissue, organ or organism as defined by its 

25 collection of ESTs. 

Yet a further aspect of this invention provides a composition suitable 
for use in hybridization, which comprises a solid surface on which is immobilized at 
pre-defined regions thereon a plurality of defined oligonucleotide/polynucleotide 
sequences for hybridization, each sequence comprising a fragment of an EST isolated 

30 from a cDNA or DNA library prepared from at least one selected tissue or cell 
sample of a healthy (i.e., pre-disease state) animal, at least one analogous sample of 
an animal having a disease, at least one analogous sample of an animal infected with a 
pathogen or the pathogen itself, or any combination or multiple combinations thereof. 

An additional aspect of the invention provides an isolated gene 

35 sequence which is differentially expressed in a normal healthy animal and an animal 
having a disease, and is identified by the methods above. Similarly, an isolated 
pathogen gene sequence which is expressed in tissue or cell samples of an infected 
animal can be identified by the methods above. 

3 
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Yet another aspect of the invention is that it provides not only a means 
for a static diagnostic but also provides a means for a carrying out the procedure over 
time to measure disease progression as well as monitoring the efficacy of disease 
treatment regimes including an toxicological effects thereof. 
5 Another aspect of the invention is an isolated protein produced by 

expression of the gene sequences identified above. Such proteins are useful in 
therapeutic compositions or diagnostic compositions, or as targets for drug 
development 

Other aspects and advantages of the present invention are described 
10 further in the following detailed description of the preferred embodiments thereof. 

Detailed Description of the Invention 

The present invention meets the unfulfilled needs in the art by 
providing methods for the identification and use of gene fragments and genes, even 

15 those of unknown full length sequence and unknown function, which are 
differentially expressed in a healthy animal and in an animal having a specific disease 
or infection by use of ESTs derived from DNA libraries of healthy and/or 
diseased/infected animals. Employing the methods of this invention permits the 
resulting identification and isolation of such genes by using their corresponding ESTs 

20 and thereby also permits the production of protein products encoded by such genes. 
The genes themselves and/or protein products, if desired, may be employed in the 
diagnosis or therapy of the disease or infection with which the genes are associated 
and in the development of new drugs therefor. 

It has been appreciated that one or more differentially identified EST 

25 or gene-specific oligonucleotide/polynucleotides define a pattern of differentially 
expressed genes diagnostic of a predisease, disease or infective state. A knowledge of 
the specific biological function of the EST is not required only that the ESTs 
identifies a gene or genes whose altered expression is associated reproducibly with 
the predisease, disease or infectious state. The differences permit the identification of 

30 gene products altered in their expression by the disease and represent those products 
most likely to be targets of therapeutic intervention. Similarly, the product may be of 
the infecting organism itself and also be an effective target of intervention. 

/. Definitions. 

35 Several words and phrases used throughout this specification are 

defined as follows: 

As used herein, the term "gene" refers to the genomic nucleotide 
sequence from which a cDNA sequence is derived, which cDNA produces an EST, as 

4 
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described below. The term gene classically refers to the genomic sequence, which, 
upon processing, can produce different cDNAs, e.g., by splicing events. However, 
for ease of reading, any fulMength counterpart cDNA sequence which gives rise to an 
EST will also be referred to by shorthand herein as a 'gene*. 
5 The term "organism" includes without limitation, microbes, plants and 

animals. 

The term "animal" is used in its broadest sense to include all members 
of the animal kingdom, including humans. It should be understood, however, that 
according to this invention the same species of animal which provides the biological 
10 sample also is the source of the defined immobilized oligonucleotide/jxriynucleotides 
as defined below. 

The term "pathogen" is defined herein as any molecule or organism 
which is capable of infecting an animal or plant and replicating its nucleic acid 
sequences in the cells or tissues of that animal or plant . Such a pathogen is generally 

15 associated with a disease condition in the infected animal or plant. Such pathogens 
may include viruses, which replicate intra- or extra-cellularly, or other organisms, 
such as bacteria, fungi or parasites, which generally infect tissues or the blood. 
Certain pathogens or microorganisms are known to exist in sequential and 
distinguishable stages of development, e.g., latent stages, infective stages, and stages 

20 which cause symptomatic diseases. In these different stages, the pathogens are 
anticipated to express differentially certain genes and/or turn on or off host cell gene 
expression. 

As used herein, the term "disease" or "disease state" refers to any 
condition which deviates from a normal or standardized healthy state in an organism 

25 of the same species in terms of differential expression of the organism's genes. In 
other words, a disease state can be any illness or disorder be it of genetic or 
environmental origin , for example, an inherited disorder such as certain breast 
cancers, or a disorder which is characterized by expression of gene(s) normally in an 
inactive, turned off state in a healthy animal, or a disorder which is characterized by 

30 under-expression or no expression of gene(s) which is normally activated or 'turned 
on 1 in a normal healthy animal. Such differential expression of genes may also be 
detected in a condition caused by infection, inflammation, or allergy, a condition 
caused by development or aging of the animal, a condition caused by administration 
of a drug or exposure of the animal to another agent, e.g., nutrition, which affects 

35 gene expression. Essentially, the methods described herein can be adapted to detect 
differential gene expression resulting from any cause, by manipulation of the defined 
oligonucleotide/polynucleotides and the samples tested as described below. The 
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concept of disease or disease state also includes its temporal aspects in terms of 
progression and treatment. 

The phrase "differentially expressed" refers to those situations in 
which a gene transcript is found in differing numbers of copies, or in activated vs 
5 inactivated states, in different cell types or tissue types of an organism, having a 
selected disease as contrasted to the levels of the gene transcript found in the same 
cells or tissues of a healthy organism. Genes may be differentially expressed in 
differing states of activation in microorganisms or pathogens in different stages of 
development. For example, multiple copies of gene transcripts may be found in an 

10 organism having a selected disease, while only one, or significantly fewer copies, of 
the same gene transcript are found in a healthy organism, or vice-versa. 

As used herein, the term "solid support" refers to any known substrate 
which is useful for the immobilization of large numbers of 
oligonucleotide/polynucleotide sequences by any available method to enable 

IS detectable hybridization of the immobilized oligonucleotide/polynucleotide sequences 
with other polynucleotide sequences in a sample. Among a number of available solid 
supports, one desirable example is the supports described in International Patent 
Application No. W09 1/07087, published May 30, 1991. Also useful are suports such 
as but not limited to nitrocellulose, mylein, glass, silica ans Pall Biodyne C® It is 

20 also anticipated that improvements yet to be made to conventional solid supports may 
also be employed in this invention. 

The term "surface" means any generally two-dimensional structure on 
a solid support to which the desired oligonucleotide/polynucleotide sequence is 
attached or immobilized. A surface may have steps, ridges, kinks, terraces and the 

25 like. 

As used herein, the term "predefined region" refers to a localized area 
on a surface of a solid support on which is immobilized one or multiple copies of a 
particular oligonucleotide/polynucleotide sequence and which enables the 
identification of the oligonucleotide/polynucleotide at the position, if hybridization of 
30 that oligonucleotide/polynucleotide to a sample polynucleotide occurs. 

By "immobilized" refers to the attachment of the 
oligonucleotide/polynucleotide to the solid support. Means of immobilization are 
known and conventional to those of skill in the art, and may depend on the type of 
support being used. 

35 By "EST" or "Expressed Sequence Tag" is meant a partial DNA or 

cDNA sequence of about 150 to 500, more preferably about 300, sequential 
nucleotides of a longer sequence obtained from a genomic or cDNA library prepared 
from a selected cell, cell type, tissue or tissue type, organ or organism which longer 
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sequence corresponds to an mRNA of a gene found in that library. An EST is 
generally DNA. One or more libraries made from a single tissue type typically 
provide at least about 3000 different (i.e., unique) ESTs and potentially the full 
complement of all possible ESTs representing all cDNAs e.g., 50,000-100,000 in an 
5 animal such as a human. Further background and information on the construction of 
ESTs is described in M. D. Adams et al, ficiencfi, 252:1651-1656 (1991); and 
International Application Number PCT/US92/05222 (January 7, 1993). 

As used herein, the term "defined oligonucleotide/polynucleotide 
sequence" refers to a known nucleotide sequence fragment of a selected EST or gene. 

10 This term is used interchangeably with the term "fragments of EST". These 
sequential sequences are generally comprised of between about 15 to about 45 
nucleotides and more preferably between about 20 to about 25 nucleotides in length. 
Thus any single EST of 300 nucleotides in length may provide about 280 different 
defined oligonucleotide/polynucleotide sequences of 20 nucleotides in length (e.g., 

15 20-mers). The lengths of the defined oligonucleotide/polynucleotides may be readily 
increased or decreased as desired or needed, depending on the limitations of the solid 
support on which they may be immobilized or the requirements of the hybridization 
conditions to be employed.The length is generally guided by the principle that it 
should be of sufficient length to insure that it is one average only represented once in 

20 the population to be examined. Generally, these defined 

oligonucleotide/polynucleotides are RNA or DNA and are preferably derived from 
the anti-sense strand of the EST sequence or from a corresponding mRNA sequence 
to enable their hybridization with samples of RNA or DNA. Modified nucleotides 
may be incorporated to increase stability and hybridization properties. 

25 By the term "plurality of defined oligonucleotide/polynucleotide 

sequences" is meant the following. A surface of a solid support may immobilize a 
large number of "defined oligonucleotide/polynucleotides". For example, depending 
upon the nature of the surface, it can immobilize from about 300 to upwards of 
60,000 defined 20-mer oligonucleotide/polynucleotides. It is anticipated that future 

30 improvements to solid surfaces will permit considerably larger such pluralities to be 
immobilized on a single surface. A "plurality" of sequences refers to the use on any 
one solid support of multiple different defined oligonucleotide/polynucleotides from a 
single EST from a selected library, as well as multiple different defined 
oligonucleotide/polynucleotides from different ESTs from the same library or many 

35 libraries from the same or different tissues, and may also include multiple identical 
copies of defined oligonucleotide/polynucleotides. Ultimately a pluarality has at least 
one oligonucleotide/polynucleotide per expressed gene in the entire organism For 
example, from a library producing about 5,000-10,000 ESTs, a single support can 
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include at least about 1-20 defined oligonucleotide/polynucleotides representing every 
EST in that library. The composition of defined oligonucleotide/polynucleotides 
which make up a surface according to this invention may be selected or designed as 
desired. 

5 The term "sample" is employed in the description of this invention in 

several important ways. As used herein, the term "sample" encompasses any cell or 
tissue from an organism. Any desired cell or tissue type in any desired state may be 
selected to form a sample. For example, the sample cell desired may be a human T 
cell; the desired cell type for use in this invention may be a quiescent T cell or an 

10 activated T cell. 

By the phrase "analogous sample" or "analogous cell or tissue" is 
meant that according to this invention when the ESTs which provide the defined 
oligonucleotide/polynucleotides are produced from a cDNA library prepared from a 
single tissue or cell type source sample, e.g., liver tissue of a human, then the samples 

15 used to hybridize to those immobilized defined oligonucleotide/polynucleotides are 
preferably provided by the same type of sample from either a healthy or diseased 
animal, i.e., liver tissue of a healthy human and liver tissue of a diseased or infected 
human or from a human suspected of having that disease or infection. Alternatively, 
if the surface contains defined oligonucleotide/polynucleotides from multiple cells or 

20 tissues, then the "samples" which are hybridized thereto can be but are not limited to 
samples obtained from analogous multiple tissues or cells. 

By the term "detectably hybridizing" means that the sample from the 
healthy organism or diseased or infected organism is contacted with the defined 
oligonucleotide/polynucleotides on the surface for sufficient time to permit the 

25 formation of patterns of hybridization on the surfaces caused by hybridization 
between certain polynucleotide sequences in the samples with the certain immobilized 
defined oligonucleotide/polynucleotides. These patterns are made detectable by the 
use of available conventional techniques, such as fluorescent labelling of the samples. 
Preferably hybridization takes place under stringent conditions, e.g., revealing 

30 homologies of about 95%. However, if desired, other less stringent conditions may 
be selected. Techniques and conditions for hybridization at selected stringencies are 
well known in the art [see, e.g., Sambrook et al, Molecular Cloning. A Laboratory 
Manual. . Cold Spring Harbor Laboratory, Cold Spring Harbor, NY (1989)]. 

35 //. Compositions of The Invention 

The present invention is based upon the use of ESTs from any desired 
cell or tissue in known technologies for oligonucleotide/polynucleotide hybridization. 
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A. ESTs 

An EST, as defined above, is for an animal, a sequence from a 
cDNA clone that corresponds to an mRNA. The EST sequences useful in the present 
invention are isolated preferably from cDNA libraries using a rapid screening and 

5 sequencing technique. Custom made cDNA libraries are made using known 
techniques. See, generally, Sambrook et al, cited above. Briefly, mRNA from a 
selected cell or tissue is reverse transcribed into complementary DNA (cDNA) using 
the reverse transcriptase enzyme and made double-stranded using RNase H coupled 
with DNA polymerase or reverse transcriptase. Restriction enzyme sites are added to 

10 the cDNA and it is cloned into a vector. The result is a cDNA library. Alternatively, 
commercially available cDNA libraries may be used. Libraries of cDNA can also be 
generated from recombinant expression of genomic DNA using known techniques, 
including polymerase chain reaction-derived techniques. 

ESTs (which can range from about 150 to about 500 nucleotides in 

15 length, preferably about 300 nucleotides) can be obtained through sequence analysis 
from either end of the cDNA insert Desirably, the DNA libraries used to obtain 
ESTs use directional cloning methods so that either the 5* end of the cDNA (likely to 
contain coding sequence) or the 3' end (likely to be a non-coding sequence) can be 
selectively obtained. 

20 In general, the method for obtaining ESTs comprises applying 

conventional automated DNA sequencing technology to screen clones, 
advantageously randomly selected clones, from a cDNA library. The cDNA libraries 
from the desired tissue can be preprocessed, or edited, by conventional techniques to 
reduce repeated sequencing of high and intermediate abundance clones and to 

25 maximize the chances of finding rare messages from specific cell populations. 
Preferably, preprocessing includes the use of defined composition prescreening 
probes, e.g., cDNA corresponding to mitochondria, abundant sequences, ribosomes, 
actins, myelin basic polypeptides, or any other known high abundance peptide. These 
prescreening probes used for preprocessing are generally derived from known ESTs. 

30 Other useful preprocessing techniques include subtraction hybridization, which 
preferentially reduces the population of highly represented sequences in the library 
[e.g., see Fargnoli et al, Anal. Biochem. . I£Z:364 (1990)] and normalization, which 
results in all sequences being represented in approximately equal proportions in the 
library [Patanjali et al, Proc. Natl. Acad. Sci. USA , S&1943 (1991)]. Additional 

35 prescreening/differential screening approaches are known to those skilled in the art. 

ESTs can then be generated from partial DNA sequencing of the 
selected clones. The ESTs useful in the present invention are preferably generated 
using low redundancy of sequencing, typically a single sequencing reaction. While 
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single sequencing reactions may have an accuracy as low as 90%, this nevertheless 
provides sufficient fidelity for identification of the sequence and design of PCR 
primers. 

If desired, the location of an EST in a full length cDNA is determined 
5 by analyzing the EST for the presence of coding sequence, A conventional computer 
program is used to predict the extent and orientation of the coding region of a 
sequence (using all six reading frames). Based on this information, it is possible to 
infer the presence of start or stop codons within a sequence and whether the sequence 
is completely coding or completely non-coding or a combination of the two. If start 
10 or stop codons are present, then the EST can cover both part of the S'-untranslated or 
3-untranslated part of the mRNA (respectively) as well as part of the coding 
sequence. If no coding sequence is present, it is likely that the EST is derived from 
the 3' untranslated sequence due to its longer length and the fact that most cDNA 
library construction methods are biased toward the 3' end of the mRNA. It should be 
15 understood that both coding and non-coding regions may provide ESTs equally useful 
in the described invention. 

A number of specific ESTs suitable for use in the present 
invention are described above Adams et al (supra), which may be incorporated by 
reference herein, to describe non-essential examples of desirable ESTs. Other ESTs 
20 exist in the art which may also be useful in this invention, as will ESTs yet to be 
developed by these known techniques. 

B. Preparing the Solid Support of the Invention 

Oligonucleotide sequences which are fragments of defined 
sequence are derived from each EST by conventional means, e.g., conventional 
25 chemical synthesis or recombinant techniques. Each defined 

oligonucleotide/polynucleotide sequence as described above is a fragment, can be, but 
is not necessarily an anti-sense fragment, of an EST isolated from a DNA library 
prepared from a selected cell or tissue type from a selected animal. For use in the 
present invention, it is presently preferred that the defined 
30 oligonucleotide/polynucleotide sequences are 20-25mers. As described above, for 
each EST a number of such 20-25mers may be generated. The lengths may vary as 
described above as well as the composition. For example 
oligonucleotide/polynucleotides can be modified based on the Oligo 4.0 or simiolar 
programs to predict hybridization potential or to include modifieid nucleotides for the 
35 reasons given above. It is alos appreciated that large DNA segments may be 
employed including entire ESTs or even full length genes particular when inserted 
into cloning vectors. 
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A plurality of these defined oligonucleotide/polynucleotide 
sequences are then attached to a selected solid support conventionally used for the 
attachment of nucleotide sequences again by known means. In contrast to other 
technologies available in the art, this support is designed to contain defined, not 
5 random, oligonucleotide/polynucleotide sequences. The EST fragments, or defined 
oligonucleotide/polynucleotide sequences, immobilized on the solid support can 
include fragments of one or more ESTs from a library of at least one selected tissue 
or cell sample of a healthy animal, at least one analogous sample of the animal having 
a disease, at least one analogous sample of the animal infected with a pathogen, and 
10 any combination thereof. 

Numerous conventional methods are employed for attaching 
biological molecules such as oligonucleotide/polynucleotide sequences to surfaces of 

a variety of solid supports. See, e.g., Affinity Techniques, Enjgypig Purification; Pan 

B. Methods in Enzvmologv . Vol. 34, ed. W.B. Jakoby, M. Wilcheck, Acad. Press, 
15 NY (1974); Immobilized Biochemicals and Affinity Chromatography. Advances in 
Experimental Medicine and Biology , vol. 42, ed. R. Dunlap, Plenum Press, NY 
(1974); U. S. Patent No. 4,762,881; U. S. Patent No. 4,542,102; European Patent 
Publication No. 391,608 (October 10, 1990); U. S. Patent No. 4,992,127 (Nov. 21, 
1989). 

20 One desirable method for attaching 

oligonucleotide/polynucleotide sequences derived from ESTs to a solid support is 
described in International Application No. PCT/US90/06607 (published May 30, 
1991). Briefly, this method involves forming predefined regions on a surface of a 
solidsupport, where the predefined regions are capable of immobilizing ESTs. The 

25 methods make use of binding substances attached to the surface which enable 
selective activation of the predefined regions. Upon activation, these binding 
substances become capable of binding and immobilizing 
oligonucleotide/polynucleotides based on EST or longer gene sequences. 

Any of the known solid substrates suitable for binding 

30 oligonucleotide/polynucleotides at pre-defined regions on the surface thereof for 
hybridization and methods for attaching the oligonucleotide/polynucleotides thereto 
may be employed by one of skill in the art according to this invention. Similarly, 
known conventional methods for making hybridization of the immobilized 
oligonucleotide/polynucleotides detectable, e.g., fluorescence, radioactivity, 

35 photoactivation, biotinylation, solid state circuitry, and the like may be used in this 
invention. 

Thus, by resorting to known techniques, the invention provides 
a composition suitable for use in hybridization which consists of a surface of a solid 

11 
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support on which is immobilized at pre-defined regions on said surface a plurality of 
defined oligonucleotide/polynucleotide sequences for hybridization. For example, 
one composition of this invention is a solid support on which are immobilized oligos 
of EST fragments from a library constructed from a single cell type, e.g., a human 
5 stem cell, or a single tissue, e.g., human liver, from a healthy human. Still another 
composition of this invention is another solid support on which are immobilized 
oligos of EST fragments from a library constructed from a single cell type or a tissue 
from a human having a selected disease or predispositon to a selected disease, e.g., 
liver cancer. 

10 Another embodiment of the compositions of this invention 

include a single solid support having oligonucleotides of ESTs from both single cell 
or single tissue libraries from both a healthy and diseased human. Still other 
embodiments include a single support on which are immobilized oligos of EST 
fragments from more than one tissue or cell library from a healthy human or a single 

15 support on which are immobilized more than one tissue or cell library from both 
healthy and diseased animals or humans. A preferred composition of this invention is 
anticipated to be a single support containing oligos of ESTs for all known cells and 
tissues from a selected organism. 

20 ///. The Methods of the Invention 

A . Identification of Genes 

The present invention employs the compositions described 
above in methods for identifying genes which are differentially expressed in a normal 
healthy organism and an organism having a disease or infection. These methods may 

25 be employed to detect such genes, regardless of the state of knowledge about the 
function of the gene. The method of this invention by use of the compositions 
containing multiple defined EST fragments from a single gene as described above is 
able to detect levels of expression of genes or in other cases simply the expression or 
lack thereof, which differ between normal, healthy organisms and organisms having a 

30 selected disease, disorder or infection. 

One such method employs a first surface of a solid support on 
which is immobilized at pre-defined regions thereon a plurality of defined 
oligonucleotide/polynucleotide sequences, described above, of ESTor longer gene 
fragment isolated from a cDNA library prepared from at least one selected tissue or 

35 cell sample of a healthy animal (the "healthy test surface") and a second such surface 
on which is immobilized at pre-defined regions a plurality of defined 
oligonucleotide/polynucleotide sequences of ESTor longer gene fragment isolated 
from at least one analogous tissue of an animal having a selected disease (the "disease 
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test surface"). These test surfaces may be standardized for the selected animal or 
selected cell or tissue sample from that animal (i.e., they are prcscreened for 
polymorphisms in the species population). 

Polynucleotide sequences are then isolated from mRNA and/or 
5 cDNA from a biological sample from a known healthy animal ("healthy control") and 
a second sample is similarly prepared from a sample from a known diseased animal 
("disease sample"). These two samples are desirably selected from the cell or tissue 
analogous to that which provided the immobilized oligonucleotide/polynucleotides. 

According to the method the healthy control sample is 

10 contacted with one set of the healthy test surface and the disease test surface 
described above for a time sufficient to permit detectable hybridization to occur 
between the sample and the immobilized defined oligonucleotide/polynucleotides on 
each surface. The results of this hybridization are a first hybridization pattern formed 
between the nucleotides of healthy control and the healthy test surface and a second 

15 hybridization pattern formed between the nucleotides of healthy control sample and 
the disease test surface. 

In a similar manner, the disease sample is detectably hybridized 
to another set of healthy test and disease test surfaces, forming a third hybridization 
pattern between the disease sample and healthy test surface and a fourth hybridization 

20 pattern between the disease sample and the disease test surface. 

Comparing the four hybridization patterns permits detection of 
those defined oligonucleotide/polynucleotides which are differentially expressed 
between the healthy control and the disease sample by the presence of differences in 
the hybridization patterns at pre-defined regions. The 

25 oligonucleotide/polynucleotides on each surface which correspond to the pattern 
differences may be readily identified with the corresponding ESTor longer gene 
fragment from which the oligonucleotide/polynucleotides are obtained. 

In another embodiment of the method of this invention, the 
same process is employed, with the exception that plurality of defined 

30 oligonucleotide/polynucleotide sequences forming the healthy test sample and the 
disease test sample surfaces are immobilized on a single solid support. For example, 
each fragment of an EST or longer gene fragment on the surface is isolated from at 
least two cDNA libraries prepared from a selected cell or tissue sample of a healthy 
animal and an analogous selected cell or tissue sample of an animal having a disease. 

35 According to this embodiment, the healthy control sample is 

detectably hybridized to a copy of this single solid surface, forming one hybridization 
pattern with oligonucleotide/polynucleotides associated with both the healthy and 
diseased animal. Similarly, the disease sample is detectably hybridized to a second 

13 



WO 95/21944 



PCTYUS95/01863 



copy of this single solid surface, forming one hybridization pattern with 
oligonucleotide/polynucleotides associated with both the healthy and diseased animal. 

Comparing the two hybridization patterns permits detection of 
those defined oligonucleotide/polynucleotides which are differentially expressed 
5 between the healthy control and the disease sample by the presence of differences in 
the hybridization patterns at pre-defined regions. The 
oligonucleotide/polynucleotides on each surface which correspond to the pattern 
differences may be readily identified with the corresponding ESTor longer gene 
fragment from which the oligonucleotide/polynucleotides are obtained, 

10 The identification of one or more ESTs as the source of the 

defined oligonucleotide/polynucleotide which produced a "difference" in 
hybridization patterns according to these methods permits ready identification of the 
gene from which those ESTs were derived. Because oligonuleotides are of sufficient 
length that they will hybridize under stringent conditions only with a RNA/cDNA for 

IS that gene to which they correspond, the oligo can be used to identify the EST and in 
turn the clone from which it was derived and by subsequent cloning, obtain the 
sequence of the full-length cDNA and its genomic counterparts, i.e., the gene, from 
which it was obtained. 

In other words, the ESTs identified by the method of this 

20 invention can be employed to determine the complete sequence of the mRNA, in the 
form of transcribed cDNA, by using the EST as a probe to identify a cDNA clone 
corresponding to a full-length transcript, followed by sequencing of that clone. The 
EST or the full length cDNA clone can also be used as a probe to identify a genomic 
clone or clones that contain the complete gene including regulatory and promoter 

25 regions, exons, and introns. 

It should be appreciated that one does not have to be restricted 
in using ESTs from a particular tissue from which probe RNA or cDNA is obtained, 
rather any or all ESTs (known or unknown) may be placed on the support. 
Hybridization will be used a form diagnostic patterns or to identifiy which particular 

30 EST is detected. For example, all known ESTs from an organism are used to produce 
a "master" solid support to which control sample and disease samples are alternately 
hybridized. One then detects a pattern of hybridization associated with the particular 
disaease state which then forms the basis of a diagnostic test or the isolation of 
disease specific ESTs from which the intact gene may be cloned and sequenced 

35 leading uiltimately to a defined therapuetic target. 

Methods for obtaining complete gene sequences from ESTs are 
well-known to those of skill in the art. See, generally, Sambrook et al, cited above. 
Briefly, one suitable method involves purifying the DNA from the clone that was 
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sequenced to give the EST and labeling the isolated insert DNA. Suitable labeling 
systems are well known to those of skill in the art [see, eg. Basic Methods in 
Molecular Biology, L. G. Davis et al, ed., Elsevier Press, NY (1986)]. The labeled 
EST insert is then used as a probe to screen a lambda phage cDNA library or a 
5 plasmid cDNA library, identifying colonies containing clones related to the probe 
cDNA which can be purified by known methods. The ends of the newly purified 
clones are then sequenced to identify full length sequences and complete sequencing 
of full length clones is performed by enzymatic digestion or primer walking. A 
similar screening and clone selection approach can be applied to clones from a 
10 genomic DNA library. 

Additionally, an EST or gene identified by this method as 
associated with inherited disorders can be used to determine at what stage during 
embryonic development the selected gene from which it is derived is developed by 
screening embryonic DNA libraries from various stages of development, e.g. 2-cell, 
15 8-cell, etc., for the selected gene. As has been mentioned above, the invention may 
be applied in addtional temporal modes for monitoring the progression of a disease 
state, the efficacy of a particular treatment modality or the aging process of an 
individual. 

Thus, the methods of this invention permit the identification, 
20 isolation and sequencing of a gene which is differentially expressed in a selected 
disease/infection. As described in more detail below, the identified gene may then be 
employed to obtain any protein encoded thereby, or may be employed as a target for 
diagnostic methods or therapeutic approaches to the treatment of the disease, 
including, e.g., drug development 
25 The same methods as described above for the identification of 

genes, including genes of unknown function, which are differentially expressed in a 
disease state, may also be employed to identify other genes of interest. For example, 
another embodiment of this invention includes a method for identifying a gene of a 
pathogen which is expressed in a biological sample of an animal infected with that 
30 pathogen or the gene of the host which is altered in its expression as a result of the 
infection. 

One such method employs a healthy test surface as described 
above, employing defined oligonucleotide/polynucleotides from a sample of a 
healthy, uninfected animal. The second such surface has immobilized at pre-defined 
35 regions thereon a plurality of defined oligonucleotide/polynucleotide sequences of 
ESTs isolated from at least one analogous tissue or cell sample of an infected animal 
(the "infection test surface"). Polynucleotide sequences are isolated from a biological 
sample from a healthy animal ("healthy control") and a second sample is similarly 
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prepared from an animal infected with the selected pathogen ("infection sample"). 
These two samples are desirably selected from the cell or tissue analogous to that 
which provided the immobilized oligonucleotide/polynucleotides. It would also be 
possible to provide samples from the nucleic acid of the pathogen itself. 
5 According to the method the healthy control sample is 

contacted with one set of the healthy test surface and the infection test surface 
described above for a time sufficient to permit detectable hybridization to occur 
between the sample and the immobilized defined oligonucleotide/polynucleotides on 
each surface. The results of this hybridization are a first hybridization pattern formed 

10 between the nucleotides of healthy control and the healthy test surface and a second 
hybridization pattern formed between the nucleotides of healthy control sample and 
the infection test surface. 

In a similar manner, the infection sample is detectably 
hybridized to another set of healthy test and infection test surfaces, forming a third 

15 hybridization pattern between the infection sample and healthy test surface and a 
fourth hybridization pattern between the infection sample and the infection test 
surface. 

Comparing the four hybridization patterns permits detection of 
those defined oligonucleotide/polynucleotides which are differentially expressed 

20 between the healthy animal and the animal infected with the pathogen by the presence 
of differences in the hybridization patterns at pre-defined regions. As mentioned 
differential expression is not required and simple qualitative analysis is possible by 
reference to gene expression which is simply present or absent. 

A second embodiment of this method parallels the second 

25 embodiment of the method as applied to disease above, i.e., the same process is 
employed, with the exception that plurality of defined oligonucleotide/polynucleotide 
sequences forming the healthy test sample surface and the infection test sample 
surface are immobilized on a single solid support. The resulting first hybridization 
pattern (healthy control sample with healthy/infection test sample) and second 

30 hybridization pattern (infection sample with healthy/infection test sample) permits 
detection of those defined oligonucleotide/polynucleotides which are differentially 
expressed between the healthy control and the infection sample by the presence of 
differences in the hybridization patterns at pre-defined regions. The 
oligonucleotide/polynucleotides on each surface which correspond to the pattern 

35 differences may be readily identified with the corresponding ESTs from which the 
oligonucleotide/polynucleotides are obtained. 

As described above for the methods for identifying differential 
gene expression between diseased and healthy animals, the 
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oligonucleotide/polynucleotides on each surface which correspond to the pattern 
differences may be readily identified with the corresponding ESTs from which the 
oligonucleotide/polynucleotide sequences are obtained and the genes expressed by the 
pathogen identified for similar purposes. Other embodiments of these methods may 
5 be developed with resort to the teaching herein, by altering the samples which provide 
the defined oligonucleotide/polynucleotides. For example, an EST, identified with a 
differentially expressed gene by the method of this invention is also useful in 
detecting genes expressed in the various stages of an pathogen's development, 
particularly the infective stage and following the cours of drug treatment and 

10 emergence of resistant variants. For example, employing the techniques described 
above, the EST can be used for detecting a gene in various stages of the parasitic 
Plasmodium species life cycle, which include blood stages, liver stages, and 
gametocyte stages. 

B. Diagnostic Methods 

15 In addition to use of the methods and compositions of this 

invention for identifying differentially expressed genes, another embodiment of this 
invention provides diagnostic methods for diagnosing a selected disease state, or a 
selected state resulting from aging, exposure to drugs or infection in an animal. 
According to this aspect of the invention, a first surface, described as the healthy test 

20 surface above, and a second surface, described as the disease test surface or infection 
test surface, are prepared depending on the disease or infection to be diagnosed. The 
same processes of detectable hybridization to a first and second set of these surfaces 
with the healthy control sample and disease/infection sample are followed to provide 
the four above-described hybridization patterns, i.e M healthy control sample with 

25 healthy test surface; healthy control sample with disease/infection test surface; 
disease/infection sample with healthy test surface; and disease/infection sample with 
disease/infection test surface. 

The diagnosis of disease or infection is provided by comparing 
the four hybridization patterns. Substantial differences between the first Snd third 

30 hybridization patterns, respectively, and the second and fourth hybridization patterns, 
respectively, indicate the presence of the selected disease or infection in said animal. 
Substantial similarities in the first and third hybridization patterns and second and 
fourth hybridization patterns indicates the absence of disease or infection. 

A similar embodiment utilizes the single surface bearing both 

35 the healthy test surface defined oligonucleotide/polynucleotides and the 
disease/infection test surface defined oligonucleotide/polynucleotides as described 
above. Parallel process steps as described above for detection of genes differentially 
expressed in disease and infected states are followed, resulting in a first hybridization 
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pattern (healthy control sample with single healthy and disease/infection test sample) 
and a second hybridization pattern (disease/infection sample with another copy of the 
single healthy and disease/infection test sample). 

Diagnosis is accomplished by comparing the two hybridization 
5 patterns, wherein substantial differences between the first and second hybridization 
patterns indicate the presence of the selected disease or infection in the animal being 
tested. Substantially similar first and second hybridization patterns indicate the 
absence of disease or infection. This like many of the foregoing embodiments may 
use known or unknown ESTs derived from many libraries. 

10 C. Other Methods of the Invention 

As is obvious to one of skill in the art upon reading this 
disclosure, the compositions and methods of this invention may also be used for other 
similar purposes. For example, the general methods and compositions may be 
adapted easily by manipulation of the samples selected to provide the standardized 

15 defined oligonucleotide/polynucleotides, and selection of the samples selected for 
hybridization thereto. One such modification is the use of this invention to identify 
cell markers of any type, e.g., markers of cancer cells, stem cell markers, and the like. 
Another modification involves the use of the method and compositions to generate 
hybridization patterns useful for forensic identification or an 'expression fingerprint 1 

20 of genes for identification of one member of a species from another. Similarly, the 
methods of this invention may be adapted for use in tissue matching for 
transplantation purposes as well as for molecular histology, i.e M to enable diagnosis of 
disease or disorders in pathology tissue samples such as biopsies. Still another use of 
this method is in monitoring the effects of development and aging upon the gene 

25 expression in a selected animal, by preparing surfaces bearing 
oligonucleotide/polynucleotides prepared from samples of standardized younger 
members of the species being tested. Additionally the patient can serve as an internal 
control by virtue of having the method applied to blood samples every 5-10 years 
during his lifetime. 

30 Still another intriguing use of this method is in the area of 

monitoring the effects of drugs on gene expression, both in laboratories and during 
clinical trials with animal, especially humans. Because the method can be readily 
adapted by altering the above parameters, it can essentially be employed to identify 
differentially expressed genes of any organism, at any stage of development, and 

35 under the influence of any factor which can affect gene expression. 
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IV. The Genes and Proteins Identified 

Application of the compositions and methods of this invention as 
above described also provide other compositions, such as any isolated gene sequence 
which is differentially expressed between a normal healthy animal and an animal 
5 having a disease or infection. Another embodiment of this invention is any isolated 
pathogen gene sequence which is expressed in tissue or cell samples of an infected 
animal. Similarly an embodiment of this invention is any gene sequence identified by 
the methods described herein. 

These gene sequences may be employed in conventional methods to 

10 produce isolated proteins encoded thereby. To produce a protein of this invention, 
the DNA sequences of a desired gene identified by the use of the methods of this 
invention or portions thereof are inserted into a suitable expression system. 
Desirably, a recombinant molecule or vector is constructed in which the 
polynucleotide sequence encoding the protein is operably linked to a heterologous 

15 expression control sequence permitting expression of the human protein. Numerous 
types of appropriate expression vectors and host cell systems are known in the art for 
mammalian (including human) expression, insect, e.g., baculovirus expression, yeast, 
fungal, and bacterial expression, by standard molecular biology techniques. 

The transfection of these vectors into appropriate host cells, whether 

20 mammalian, bacterial, fungal, or insect, or into appropriate viruses, can result in 
expression of the selected proteins. Suitable host cells or cell lines for transfection, 
and viruses, as well as methods for the construction and transfection of such host cells 
and viruses are well-known. Suitable methods for transfection, culture, amplification, 
screening, and product production and purification are also known in the art. 

25 The genes and proteins identified by this invention can be employed, if 

desired in diagnostic compositions useful for the diagnosis of a disease or infection 
using conventional diagnostic assays. For example, a diagnostic reagent can be 
developed which detectably targets a gene sequence or protein of this invention in a 
biological sample of an animal. Such a reagent may be a complementary nucleotide 

30 sequence, an antibody (monoclonal, recombinant or polyclonal), or a chemically 
derived agonist or antagonist. Alternatively, the proteins and polynucleotide 
sequences of this invention, fragments of same, or complementary sequences thereto, 
may themselves be useful as diagnostic reagents for diagnosing disease states with 
which the ESTs of the invention are associated. These reagents may optionally be 

35 labelled using diagnostic labels, such as radioactive labels, colorimetric enzyme label 
systems and the like conventionally used in diagnostic or therapeutic methods, e,g, 
Northern and Western blotting, antigen-antibody binding and the like. The selection 
of the appropriate assay format and label system is within the skill of the art and may 
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readily be chosen without requiring additional explanation by resort to the wealth of 
art in the diagnostic area. 

Additionally, genes and proteins identified according to this invention 
may be used therapeutically. For example, the EST-containing gene sequences may 
5 be useful in gene therapy, to provide a gene sequence which in a disease is not 
properly or sufficiently expressed In such a method, a selected gene sequence of this 
invention is introduced into a suitable vector or other delivery system for delivery to a * 
cell containing a defect in the selected gene. Suitable delivery systems are well 
known to those of skill in the art and enable the desired EST or gene to be 

10 incorporated into the target cell and to be translated by the cell. The EST or gene 
sequence may be introduced to mutate the existing gene by recombination or provide 
an active copy thereof in addition to the inactive gene to replace its function. 

Alternatively, a protein encoded by an EST or gene of the invention 
may be useful as a therapeutic reagent for delivery of a biologically active protein, 

IS particularly when the disease state is associated with a deficiency of this protein. 
Such a protein may be incorporated into an appropriate therapeutic formulation, alone 
or in combination with other active ingredients. Methods of formulating such 
therapeutic compositions, as well as suitable pharmaceutical carriers, and the like, are 
well known to those of skill in the an. Still an additional method of delivering the 

20 missing protein encoded by an EST, or the gene from which a selected EST was 
derived, involves expressing it directly in vivo. Systems for such in vivo expression 
are well known in the art 

Yet another use of the ESTs, genes identified according to the methods 
of this invention, or the proteins encoded thereby is a target for the screening and 

25 development of natural or synthetic chemical compounds which have utility as 
therapeutic drugs for the treatment of disease states associated with the identified 
genes and ESTs derived therefrom. As one example, a compound capable of binding 
to such a protein encoded by such a gene and either preventing or enhancing its 
biological activity may be a useful drug component for the treatment or prevention of 

30 such disease states. 

Conventional assays and techniques may be used for the screening and 
development of such drugs. As one example, a method for identifying compounds 
which specifically bind to or inhibit or activate proteins encoded by these gene 
sequences can include simply the steps of contacting a selected protein or gene 

35 product, with a test compound to permit binding of the test compound to the protein; 
and determining the amount of test compound, if any, which is bound to the protein. 
Such a method may involve the incubation of the test compound and the protein 
immobilized on a solid support. Still other conventional methods of drug screening 
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can involve employing a suitable computer program to determine compounds having 
similar or complementary chemical structures to that of the gene product or portions 
thereof and screening those compounds either for competitive binding to the protein 
to detect enhanced or decreased activity in the presence of the selected compound. 
5 Thus, through use of such methods, the present invention is anticipated 

to provide compounds capable of interacting with these genes, ESTs, or encoded 
«» proteins, or fragments thereof, and either enhancing or decreasing the biological 

activity, as desired. Such compounds are believed to be encompassed by this 
invention. 

10 Numerous modifications and variations of the present invention are 

included in the above-identified specification and are expected to be obvious to one of 
skill in the art. Such modifications and alterations to the compositions and processes 
of the present invention are believed to be encompassed in the scope of the claims 
appended hereto. 

15 
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WHAT IS CLAIMED IS: 

1. A method for identifying genes which are differentially expressed in 
two different pre-determined states of an organism comprising: 
5 a. providing a first surface on which is immobilized at pre-defined 

regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene, isolated from a DNA library 
prepared from at least one selected cell, tissue, organ or organism sample in a first 
10 state and present in excess relative to the polynucleotide to be hybridized; 

b. providing a second surface on which is immobilized at pre-defined 
regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene, isolated from a DNA library 

15 prepared from at least one selected cell, tissue, organ or organism sample in a second 
state and present in excess relative to the polynucleotide to be hybridized; 

c. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from a said organism in said first 
state, said sample selected from sources analogous to the sources of step (a), said 

20 hybridization sufficient to form a first and second hybridization pattern on each said 
first and second surface, 

& detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from said organism in said second 
state, said sample selected from sources analogous to the sources of step (c), said 

25 hybridization sufficient to form a third and fourth hybridization pattern on each said 
first and second surface, 

e. comparing at least two of the four hybridization patterns, 
wherein genes differentially expressed in said first and second states are identified by 
the presence of differences in the hybridization patterns at pre-defined regions; 

30 f. identifying the oligonucleotide/polynucleotides on each surface 

which correspond to said pattern differences and the corresponding ESTs or larger 
gene fragment from which the oligonucleotide/polynucleotides were obtained, 
whereby identification of the EST or larger gene fragment permits identification of 
the gene from which the ESTs or larger gene fragment were derived. 

35 
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2. The method according to Claim 1 wherein said first and second states are 
respectively healthy and disease; pathogen uninfected and pathogen infected; a first 
progression state and a second progression of a disease or infection; a first treatment 
state and a second treatment state of a disease or infection; or a first developmental 

5 and a second developmental state. 

3. The method according to Claim 1 wherein said organism is a plant or an 

animal. 

10 4. The method according to Claim 3 wherein said aniaml is a human. 

5. A method for identifying genes which are differentially expressed in a 
normal healthy animal and an animal having a disease comprising: 

a. providing a first surface on which is immobilized at pre- 
15 defined regions on said surface a plurality of defined oligonucleotide/polynucleotide 

sequences, each sequence each sequence selected from the group consisting of a 
fragment of an EST, an entire EST a fragment of a gene or an entire gene, isolated 
from a DNA library prepared from at least one selected cell, tissue, organ or organism 
sample in a healthy animal and present in excess relative to the polynucleotide to be 
20 hybridized; 

b. providing a second surface on which is immobilized at pre- 
defined regions of said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence each sequence selected from the group consisting of a 
fragment of an EST, an entire EST a fragment of a gene or an entire gene, isolated 

25 from a DNA library prepared from at least one selected cell, tissue, organ or organism 
sample from an animal having said disease and present in excess relative to the 
polynucleotide to be hybridized; 

c. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from a healthy animal, said sample 

30 selected from sources analogous to the sources of step (a), said hybridization 
sufficient to form a first and second hybridization pattern on each said first and 
second surface, said sample selected from a cell or tissue sample analogous to the 
sample of step (a), said hybridization sufficient to form a first and second 
hybridization pattern on each said first and second surface; 
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d. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a sample from an animal having said disease, 
said sample selected from a cell or tissue sample analogous to the sample of step (c), 
said hybridization sufficient to form a third and fourth hybridization pattern on each 

5 said first and second surface, 

e. comparing at least two of the four hybridization patterns, 
wherein genes differentially expressed in said first and second states are identified by 
the presence of differences in the hybridization patterns at pre-defined regions; 

f. identifying the oligonucleotide/polynucleotides on each surface 
10 which correspond to said pattern differences and the corresponding ESTs or larger 

gene fragment from which the oligonucleotide/polynucleotides were obtained, 
whereby identification of the EST or larger gene fragment permits identification of 
the gene from which the ESTs or larger gene fragment were derived. 

15 6, A method for identifying genes which are differentially expressed in a 

normal healthy animal and an animal having a disease comprising: 

a. providing a surface on which is immobilized at pre-defined 
regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 

20 an entire EST a fragment of a gene or an entire gene isolated from a DNA library 
prepared from the group selected from at least one selected cell, tissue, organ or 
organism sample in of a healthy animal and an analogous selected sample of an 
animal having said disease and both present in excess relative to the polynucleotide to 
be hybridized; 

25 b. detectably hybridizing to a first copy of said surface 

polynucleotide sequences isolated from a healthy animal, said sample selected from a 
cell or tissue sample analogous to the sample of step (a), said hybridization sufficient 
to form a first hybridization pattern on said surface; 

c. detectably hybridizing to a second copy of said surface 
30 polynucleotide sequences isolated from an animal having said disease, said sample 

selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form a second hybridization pattern on said surface; 

d. comparing the two hybridization patterns, wherein genes 
differentially expressed in a disease state are identified by the presence of differences 

35 in the hybridization patterns at pre-defined regions; 
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e. identifying the oligonucleotide/polynucleotides on each surface 
which correspond to said pattern differences and the corresponding ESTs from which 
the oligonucleotide/polynucleotides are obtained, whereby identification of the EST 
permits identification of the gene from which the ESTs were derived. 

5 

7. A method for identifying a gene of a pathogen which is expressed in a 
biological sample of an animal infected with said pathogen comprising: 

a. providing a first surface on which is immobilized at pre- 
defined regions on said surface a plurality of defined oligonucleotide/polynucleotide 
10 sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene isolated from a DNA library 
prepared from at least one selected cell, tissue, organ or organism sample of a 
healthy, uninfected animal and present in excess relative to the polynucleotide to be 
hybridized; 

15 b. providing a second surface on which is immobilized at pre- 

defined regions of said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene isolated from at least one 
selected cell, tissue, organ or organism sample of an infected animal; 

20 c. detectably hybridizing to a set of said first and second surfaces 

polynucleotide sequences isolated from a sample from a healthy animal, said sample 
selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form first and second hybridization patterns on each said 
first and second surface, 

25 d. detectably hybridizing to a set of said first and second surfaces 

polynucleotide sequences isolated from a sample from an infected animal, said 
sample selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form third and fourth hybridization patterns on each said 
first and second surface, 

30 e. comparing the four hybridization patterns, wherein genes of 

said pathogen which are expressed in an infected animal are identified by the 
presence of differences in the hybridization patterns at pre-defined regions; 

f identifying the oligonucleotide/polynucleotides on each surface 
which correspond to said pattern differences and the corresponding ESTs from which 

35 the oligonucleotide/polynucleotides are obtained, whereby identification of the EST 
permits identification of the gene from which the ESTs were derived. 

25 
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8. A method for identifying a gene of a pathogen which is expressed in a 
biological sample of an animal infected with said pathogen comprising: 

a. providing a surface on which is immobilized at pre-defined 
regions on said surface a plurality of defined oligonucleotide/polynucleotide 

5 sequences, each sequence selected from the group consisting of a fragment of an EST, 
an entire EST a fragment of a gene or an entire gene isolated from a DNA library 
prepared from the group selected from at least one selected cell, tissue, organ or 
organism sample in of a healthy animal and an analogous selected sample of an 
animal having said disease and both present in excess relative to the polynucleotide to 

10 be hybridized 

b. detectably hybridizing to a first copy of said surface 
polynucleotide sequences isolated from a sample from a healthy animal, said sample 
selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to fonn a first hybridization pattern on said surface; 

15 c. detectably hybridizing to a second copy of said surface 

polynucleotide sequences isolated from a sample from an infected animal, said 
sample selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form a second hybridization pattern on said surface; 

d. comparing the two hybridization patterns, wherein genes of 
20 said pathogen which are expressed in an infected animal are identified by the 

presence of differences in the hybridization patterns at pre-defined regions; 

e. identifying the oligonucleotide/polynucleotides on each surface 
which correspond to said pattern differences and the corresponding ESTs from which 
the oligonucleotide/polynucleotides are obtained, whereby identification of the EST 

25 permits identification of the gene from which the ESTs were derived. 

9. A composition suitable for use in hybridization comprising a solid 
surface on which is immobilized at pre-defined regions on said surface a plurality of 
defined oligonucleotide/polynucleotide sequences for hybridization, each sequence 

30 selected from the group consisting of a fragment of an EST, an entire EST a fragment 
of a gene or an entire gene isolated from a DNA library prepared from the group 
selected from at least one selected cell, tissue, organ or organism sample of a healthy 
animal, at least one analogous sample of said animal having a disease, at least one 
analogous sample of said animal infected with a microbial pathogen, and any 

35 combination thereof. 
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10. An isolated gene sequence which is differentially expressed in a 
normal healthy animal and an animal having a disease, identified by the method of 
claim L 

5 11. An isolated pathogen gene sequence which is expressed in tissue or 

cell samples of an infected animal identified by the method of claim 7. 

12. A diagnostic composition useful for the diagnosis of a disease 
comprising a reagent capable of detectably targeting a gene sequence of claim 10 in a 

10 biological sample of an animal. 

13. A diagnostic composition useful for the diagnosis of infection by a 
pathogen comprising a reagent capable of detectably targeting a gene sequence of 
claim 1 1 in a biological sample of an animal. 

15 

14. An isolated protein produced by expression of a gene sequence of 
claim 10. 

15. An isolated pathogen protein produced by expression of a gene 
20 sequence of claim 1 1 . 

16. A therapeutic composition comprising a protein or fragment thereof 
selected from the group consisting of a protein of claim 10 and a protein of claim 15. 

25 17. A method for diagnosing a selected disease or infection in an animal 

comprising: 

a. providing a first surface on which is immobilized at pre- 
defined regions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence selected from the group consisting of a fragment of an EST, 

30 an entire EST a fragment of a gene or an entire gene, isolated from a DNA library 
prepared from at least one selected cell, tissue, organ or organism sample of a healthy 
animal and present in excess relative to the polynucleotide to be hybridized; 

b. providing a second surface on which is immobilized at pre- 
defined regions of said surface a plurality of defined oligonucleotide/polynucleotide 

35 sequences, each sequence comprising a fragment of an EST isolated from at least one 
said tissue of an animal having said disease; 
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c. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a DNA library prepared from a sample from a 
healthy animal, said sample selected from a cell or tissue sample analogous to the 
sample of step (a), said hybridization sufficient to form a first and second 
hybridization pattern on each said first and second surface; » 

d. detectably hybridizing to a set of said first and second surfaces 
polynucleotide sequences isolated from a DNA library prepared from a sample from » 
an animal having said disease, said sample selected from a cell or tissue sample 
analogous to the sample of step (c), said hybridization sufficient to form a third and 
fourth hybridization pattern on each said first and second surface; 

e. comparing the four hybridization patterns, wherein substantial 
differences between the first and third hybridization patterns and the second and 
fourth hybridization patterns indicates the presence of said selected disease or 
infection in said animal, and substantial similarities in said first and third 
hybridization patterns and second and fourth hybridization patterns indicates the 
absence of disease or infection. 

18. A method for diagnosing a selected disease or infection in an animal 
comprising: 

20 a. providing a surface on which is immobilized at pre-defined 

legions on said surface a plurality of defined oligonucleotide/polynucleotide 
sequences, each sequence comprising a fragment of an EST isolated from a DNA 
library prepared from the group consisting of a selected cell or tissue sample of a 
healthy animal and an analogous selected cell or tissue sample of an animal having 

25 said disease; 

b. detectably hybridizing to a first copy of said surface 
polynucleotide sequences isolated from a sample from a healthy animal, said sample 
selected from a cell or tissue sample analogous to the sample of step (a), said 
hybridization sufficient to form a first hybridization pattern on said surface; 

30 c. detectably hybridizing to a second copy of said surface 

polynucleotide sequences isolated from a DNA library prepared from a sample from 
an animal having said disease, said sample selected from a cell or tissue sample 
analogous to the sample of step (a), said hybridization sufficient to form a second 
hybridization pattern on said surface; 

35 d. comparing the two hybridization patterns, wherein substantial 

differences between the first and second hybridization patterns indicates the presence 
of said selected disease or infection in said animal, and substantial similarities in said 
first and second hybridization patterns indicates the absence of disease or infection. 
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COMPARATIVE GENE TRANSCRIPT ANALYSIS 



1. FIELD OF INVENTION 
The present invention is in the field of molecular 
biology and computer science; more particularly, the 
5 present invention describes methods of analyzing gene 

transcripts and diagnosing the genetic expression of cells 
and tissue. 



2. BACKGROUND OF THE INVENTION 
Until very recently, the history of molecular biology 

10 has been written one gene at a time. Scientists have 
observed the cell's physical changes, isolated mixtures 
from the cell or its milieu, purified proteins, sequenced 
proteins and therefrom constructed probes to look for the 
. corresponding gene. 

15 Recently, different nations have set up massive 

projects to sequence the billions of bases in the human 
genome. These projects typically begin with dividing the 
genome into large portions of chromosomes and then 
determining the sequences of these pieces, which are then 

20 analyzed for identity with known proteins or portions 

thereof, known as motifs. Unfortunately, the majority of 
genomic DNA does not encode proteins and though it is 
postulated to have some effect on the cell's ability to 
make protein, its relevance to medical applications is not 

25 understood at this time. 

A third methodology involves sequencing only the 
transcripts encoding the cellular machinery actively 
involved in making protein, namely the mRNA. The advantage 
is that the cell has already edited out all the non-coding 

30 DNA, and it is relatively easy to identify the protein- 
coding portion of the RNA. The utility of this approach 
was not immediately obvious to genomic researchers. In 
fact, when cDNA sequencing was initially proposed, the 
method was roundly denounced by those committed to genomic 

35 sequencing. For example, the head of the U.S. Human Genome 
project discounted CDNA sequencing as not valuable and 
refused to approve funding of projects. 

In this disclosure, we teach methods for analyzing 
DNA, including cDNA libraries. Based on our analyses and 
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research, we see each individual gene product as a "pixel" 
of information, which relates to the expression of that, 
and only that, gene. We teach herein, methods whereby the 
individual "pixels" of gene expression information can be 
5 combined into a single gene transcript "image," in which 
each of the individual genes can be visualized 
simultaneously and allowing relationships between the gene 
pixels to be easily visualized and understood. 

We further teach a new method which we call electronic 
10 subtraction. Electronic subtraction will enable the gene 
researcher to turn a single image into a moving picture, 
one which describes the temporality or dynamics of gene 
expression, at the level of a cell or a whole tissue. It 
is that sense of "motion" of cellular machinery on the 
15 scale of a cell or organ which constitutes the new 

invention herein. This constitutes a new view into the 
process of living cell physiology and one which holds great 
promise to unveil and discover new therapeutic and 
diagnostic approaches in medicine. 
20 w e teach another method which we call "electronic 

northern," which tracks the expression of a single gene 
across many types of cells and tissues. 

Nucleic acids (DNA and RNA) carry within their 
sequence the hereditary information and are therefore the 
25 prime molecules of life. Nucleic acids are found in all 

living organisms including bacteria, fungi, viruses, plants 
and animals. It is of interest to determine the relative 
abundance of different discrete nucleic acids in different 
cells, tissues and organisms over time under various 
30 conditions, treatments and regimes. 

All dividing cells in the human body contain the same 
set of 23 pairs of chromosomes. It is estimated that these 
autosomal and sex chromosomes encode approximately 100,000 
genes. The differences among different types of cells are 
35 believed to reflect the differential expression of the 
100,000 or so genes. Fundamental questions of biology 
could be answered by understanding which genes are 
transcribed and knowing the relative abundance of 
transcripts in different cells. 



WO 95/20681 PCT/US95/01160 

Previously, the art has only provided for the analysis 
of a few known genes at a time by standard molecular 
biology techniques such as PCR, northern blot analysis, or 
other types of DNA probe analysis such as in situ 
5 hybridization. Each of these methods allows one to analyze 
the transcription of only known genes and/ or small numbers 
of genes at a time. Nucl. Acids Res. 19, 7097-7104 (1991); 
Nucl. Acids Res. 18, 4833-42 (1990); Nucl. Acids Res. 18, 
2789-92 (1989) ; European J. Neuroscience 2$ 1063-1073 

10 (1990); Analytical Biochem. 187 , 364-73 (1990); Genet. 
Annals Techn. Appl. 2, 64-70 (1990); GATA 8(4), 129-33 
(1991); Proc. Natl. Acad. Sci. USA 85, 1696-1700 (1988); 
Nucl. Acids Res. 19, 1954 (1991); Proc. Natl. Acad. Sci. 
USA 88, 1943-47 (1991); Nucl. Acids Res. 19, 6123-27 

15 (1991); Proc. Natl. Acad. Sci. USA 85, 5738-42 (1988); 
Nucl. Acids Res. 16, 10937 (1988). 

Studies of the number and types of genes whose 
transcription is induced or otherwise regulated during cell 
processes such as activation, differentiation, aging, viral 

20 transformation, morphogenesis, and mitosis have been 

pursued for many years, using a variety of methodologies. 
One of the earliest methods was to isolate and analyze 
levels of the proteins in a cell, tissue, organ system, or 
even organisms both before and after the process of 

25 interest. One method of analyzing multiple proteins in a 
sample is using 2-dimensional gel electrophoresis, wherein 
proteins can be, in principle, identified and quantified as 
individual bands, and ultimately reduced to a discrete 
signal. At present, 2-dimensional analysis only resolves 

30 approximately 15% of the proteins. In order to positively 
analyze those bands which are resolved, each band must be 
excised from the membrane and subjected to protein sequence 
analysis using Edman degradation. Unfortunately, most of 
the bands were present in quantities too small to obtain a 

35 reliable sequence, and many of those bands contained more 
than one discrete protein. An additional difficulty is 
that many of the proteins were blocked at the 
amino-terminus, further complicating the sequencing 
process. 
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Analyzing differentiation at the gene transcription 
level has overcome many of these disadvantages and 
drawbacks, since the power of recombinant DNA technology 
allows amplification of signals containing very small 
5 amounts of material. The most common method, called 
"hybridization subtraction, 11 involves isolation of mRNA 
from the biological specimen before (B) and after (A) the 
developmental process of interest, transcribing one set of 
mRNA into cDNA, subtracting specimen B from specimen A 
10 (mRNA from cDNA) by hybridization, and constructing a cDNA 
library from the non-hybridizing mRNA fraction. Many 
different groups have used this strategy successfully, and 
a variety of procedures have been published and improved 
upon using this same basic scheme. Nucl. Acids Res. 19, 
15 7097-7104 (1991); Nucl. Acids Res. 18, 4833-42 (1990); 
• Nucl. Acids Res. 18, 2789-92 (1989); European J. 
Neuroscience 2, 1063-1073 (1990); Analytical Biochem. 187 . 
364-73 (1990); Genet. Annals Techn. Appl. 7, 64-70 (1990); 
GATA 8(4), 129-33 (1991); Proc. Natl. Acad. Sci. USA 85, 
20 1696-1700 (1988); Nucl. Acids Res. 19, 1954 (1991); Proc. 
Natl. Acad. Sci. USA 88, 1943-47 (1991); Nucl. Acids Res. 
19, 6123-27 (1991); Proc. Natl. Acad. Sci. USA 85, 5738-42 
(1988); Nucl. Acids Res. 16, 10937 (1988). 

Although each of these techniques have particular 
25 strengths and weaknesses, there are still some limitations 
and undesirable aspects of these methods: First, the time 
and effort required to construct such libraries is quite 
large. Typically, a trained molecular biologist might 
expect construction and characterization of such a library 
30 to require 3 to 6 months, depending on the level of skill, 
experience, and luck. Second, the resulting subtraction 
libraries are typically inferior to the libraries 
constructed by standard methodology. A typical 
conventional cDNA library should have a clone complexity of 
35 at least 10 6 clones, and an average insert size of 1-3 kB. 
In contrast, subtracted libraries can have complexities of 
10 2 or 10 3 and average insert sizes of 0.2 kB. Therefore, 
there can be a significant loss of clone and sequence 
information associated with such libraries. Third, this 
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approach allows the researcher to capture only the genes 
induced in specimen A relative to specimen B, not 
vice-versa, nor does it easily allow comparison to a third 
specimen of interest (C) . Fourth, this approach requires 
5 very large amounts (hundreds of micrograms) of "driver" 
mRNA (specimen B) , which significantly limits the number 
and type of subtractions that are possible since many 
tissues and cells are very difficult to obtain in large 
quantities . 

10 Fifth, the resolution of the subtraction is dependent 

upon the physical properties of DNArDNA or RNA: DNA 
hybridization. The ability of a given sequence to find a 
hybridization match is dependent on its unique CoT value. 
The CoT value is a function of the number of copies 

15 (concentration) of the particular sequence, multiplied by 
the time of hybridization. It follows that for sequences 
which are abundant , hybridization events will occur very 
rapidly (low CoT value) , while rare sequences will form 
duplexes at very high CoT values. CoT values which allow 

20 such rare sequences to form duplexes and therefore be 
effectively selected are difficult to achieve in a 
convenient time frame. Therefore, hybridization 
subtraction is simply not a useful technique with which to 
study relative levels of rare mRNA species. Sixth, this 

25 problem is further complicated by the fact that duplex 
formation is also dependent on the nucleotide base 
composition for a given sequence. Those sequences rich in 
G + C form stronger duplexes than those with high contents 
of A + T. Therefore, the former sequences will tend to be 

30 removed selectively by hybridization subtraction. Seventh, 
it is possible that hybridization between nonexact matches 
can occur. When this happens, the expression of a 
homologous gene may "mask" expression of a gene of 
interest, artificially skewing the results for that 

35 particular gene. 

Matsubara and Okubo proposed using partial cDNA 
sequences to establish expression profiles of genes which 
could be used in functional analyses of the human genome. 
Matsubara and Okubo warned against using random priming, as 
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it creates multiple unique DNA fragments from individual 
xnRNAs and may thus skew the analysis of the number of 
particular mRNAs per library. They sequenced randomly 
selected members from a 3 '-directed cDNA library and 
5 established the frequency of appearance of the various 
ESTs. They proposed comparing lists of ESTs from various 
cell types to classify genes. Genes expressed in many 
different cell types were labeled housekeepers and those 
selectively expressed in certain cells were labeled cell- 
10 specific genes, even in the absence of the full sequence of 
the gene or the biological activity of the gene product. 

The present invention avoids the drawbacks of the 
prior art by providing a method to quantify the relative 
abundance of multiple gene transcripts in a given 
15 biological specimen by the use of high-throughput 

sequence-specific analysis of individual RNAs and/or their 
corresponding cDNAs • 

The present invention offers several advantages over 
current protein discovery methods which attempt to isolate 
20 individual proteins based upon biological effects. The 
method of the instant invention provides for detailed 
diagnostic comparisons of cell profiles revealing numerous 
changes in the expression of individual transcripts. 

The instant invention provides several advantages over 
25 current subtraction methods including a more complex 
library analysis (io 6 to io 7 clones as compared to 10 3 
clones) which allows identification of low abundance 
messages as well as enabling the identification of messages 
which either increase or decrease in abundance. These 
30 large libraries are very routine to make in contrast to the 
libraries of previous methods. In addition, homologues can 
easily be distinguished with the method of the instant 
invention. 

This method is very convenient because it organizes a 
35 large quantity of data into a comprehensible, digestible 
format. The most significant differences are highlighted 
by electronic subtraction. In depth analyses are made more 
convenient. 
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The present invention provides several advantages over 
previous methods of electronic analysis of cDNA. The 
method is particularly powerful when more than 100 and 
preferably more than 1,000 gene transcripts are analyzed. 
5 In such a case, new low-frequency transcripts are 
discovered and tissue typed. 

High resolution analysis of gene expression can be 
used directly as a diagnostic profile or to identify 
disease-specific genes for the development of more classic 
10 diagnostic approaches. 

This process is defined as gene transcript frequency 
analysis. The resulting quantitative analysis of the gene 
transcripts is defined as comparative gene transcript 
analysis. 

15 3. SUMMARY OF THE INVENTION 

The invention is a method of analyzing a specimen 
containing gene transcripts comprising the steps of (a) 
producing a library of biological sequences; (b) generating 
a set of transcript sequences, where each of the transcript 

20 sequences in said set is indicative of a different one of 
the biological sequences of the library; (c) processing the 
transcript sequences in a programmed computer (in which a 
database of reference transcript sequences indicative of 
reference sequences is stored) , to generate an identified 

25 sequence value for each of the transcript sequences, where 
each said identified sequence value is indicative of 
sequence annotation and a degree of match between one of 
the biological sequences of the library and at least one of 
the reference sequences; and (d) processing each said 

30 identified sequence value to generate final data values. 

indicative of the number of times each identified sequence 
value is present in the library. 

The invention also includes a method of comparing two 
specimens containing gene transcripts. The first specimen 

35 is processed as described above. The second specimen is 
used to produce a second library of biological sequences, 
which is used to generate a second set of transcript 
sequences, where each of the transcript sequences in the 
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second set is indicative of one of the biological sequences 
of the second library- Then the second set of transcript 
sequences is processed in a programmed computer to generate . 
a second set of identified sequence values, namely the 
5 further identified sequence values , each of which is 

indicative of a sequence annotation and includes a degree 
of match between one of the biological sequences of the 
second library and at least one of the reference sequences. 
The further identified sequence values are processed to 
10 generate further final data values indicative of the number 
of times each further identified sequence value is present 
in the second library. The final data values from the 
first specimen and the further identified sequence values 
from the second specimen are processed to generate ratios 
15 of transcript sequences, which indicate the differences in 
the number of gene transcripts between the two specimens. 

In a further embodiment, the method includes 
quantifying the relative abundance of mRNA in a biological 
specimen by (a) isolating a population of mRNA transcripts 
20 from a biological specimen; (b) identifying genes from 
which the mRNA was transcribed by a sequence-specific 
method; (c) determining the numbers of mRNA transcripts 
corresponding to each of the genes; and (d) using the mRNA 
transcript numbers to determine the relative abundance of 
25 mRNA transcripts within the population of mRNA transcripts. 
Also disclosed is a method of producing a gene 
transcript image analysis by first obtaining a mixture of 
mRNA, from which cDNA copies are made. The cDNA is 
inserted into a suitable vector which is used to transfect 
30 suitable host strain cells which are plated out and 

permitted to grow into clones, each cone representing a 
unique mRNA. A representative population of clones 
transfected with cDNA is isolated. Each clone in the 
population is identified by a sequence-specific method 
35 which identifies the gene from which the unique mRNA was 
transcribed. The number of times each gene is identified 
to a clone is determined to evaluate gene transcript 
abundance. The genes and their abundances are listed in 
order of abundance to produce a gene transcript image. 
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In a further embodiment, the relative abundance of the 
gene transcripts in one cell type or tissue is compared 
with the relative abundance of gene transcript numbers in a 
second cell type or tissue in order to identify the 
5 differences and similarities. 

In a further embodiment, the method includes a system 
for analyzing a library of biological sequences including a 
means for receiving a set of transcript sequences, where 
each of the transcript sequences is indicative of a 

10 different one of the biological sequences of the library; 
and a means for processing the transcript sequences in a 
computer system in which a database of reference transcript 
sequences indicative of reference sequences is stored, 
wherein the computer is programmed with software for 

15 generating an identified sequence value for each of the 
transcript sequences, where each said identified sequence 
value is indicative of a sequence annotation and the degree 
of match between a different one of the biological 
sequences of the library and at least one of the reference 

20 sequences, and for processing each said identified sequence 
value to generate filial data values indicative of the 
number of times each identified sequence value is present 
in the library. 

In essence, the invention is a method and system for 

25 quantifying the relative abundance of gene transcripts in a 
biological specimen. The invention provides a method for 
comparing the gene transcript image from two or more 
different biological specimens in order to distinguish 
between the two specimens and identify one or more genes 

30 which are differentially expressed between the two 
specimens. Thus, this gene transcript image and its 
comparison can be used as a diagnostic. One embodiment of 
the method generates high-throughput sequence-specific 
analysis of multiple RNAs or their corresponding cDNAs: a 

35 gene transcript image. Another embodiment of the method 

produces the gene transcript imaging analysis by the use of 
high-throughput cDNA sequence analysis. In addition, two 
or more gene transcript images can be compared and used to 
detect or diagnose a particular biological state, disease, 
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or condition which is correlated to the relative abundance 
of gene transcripts in a given cell or population of cells. 

4. DESCRIPTION OF THE TABLES AND DRAWINGS 
4.1. TABLES 

5 Table 1 presents a detailed explanation of the letter 

codes utilized in Tables 2-5. 

Table 2 lists the one hundred most common gene 
transcripts. It is a partial list of isolates from the 
HUVEC cDNA library prepared and sequenced as described 

10 below. The left-hand column refers to the sequence's order 
of abundance in this table. The next column labeled 
"number" is the clone number of the first HUVEC sequence 
identification reference matching the sequence in the 
"entry" column number. Isolates that have not been 

15 sequenced are not present in Table 2. The next column, 

labeled "N", indicates the total number of cDNAs which have 
the same degree of match with the sequence of the reference 
transcript in the "entry" column. 

The column labeled "entry" gives the NIH GENBANK locus 

20 name, which corresponds to the library sequence numbers. 
The "s" column indicates in a few cases the species of the 
reference sequence. The code for column "s" is given in 
Table 1. The column labeled "descriptor" provides a plain 
English explanation of the identity of the sequence 

25 corresponding to the NIH GENBANK locus ' name in the "entry" 
column. 

Table 3 is a comparison of the top fifteen most 
abundant gene transcripts in normal monocytes and activated 
macrophage cells. 

30 Table 4 is a detailed summary of library subtraction 

analysis summary comparing the THP-1 and human macrophage 
cDNA sequences. In Table 4 # the same code as in Table 2 is 
used. Additional columns are for "bgfreq" (abundance 
number in the subtractant library) , "rf end" (abundance 

35 number in the target library) and "ratio" (the target 
abundance number divided by the subtractant abundance 
number) . As is clear from perusal of the table, when the 
abundance number in the subtractant library is "0", the 
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target abundance number is divided by 0.05. This is a way 
of obtaining a result (not possible dividing by 0) and 
distinguishing the result from ratios of subtractant 
numbers of 1. 

5 Table 5 is the computer program, written in source 

code, for generating gene transcript subtraction profiles. 

Table 6 is a partial listing of database entries used 
in the electronic northern blot analysis as provided by the 
present invention. 

10 

4.2. BRIEF DESCRIPTION OF THE DRAWINGS 
Figure 1 is a chart summarizing data collected and 
stored regarding the library construction portion of 
sequence preparation and analysis. 
15 Figure 2 is a diagram representing the sequence of 

operations performed by "abundance sort" software in a 
class of preferred embodiments of the inventive method. 

Figure 3 is a block diagram of a preferred embodiment 
of the system of the invention. 
20 Figure 4 is a more detailed block diagram of the 

bioinf ormatics process from new sequence (that has already 
been sequenced but not identified) to printout of the 
transcript imaging analysis and the provision of database 
subscriptions. 

25 5. DETAILED DESCRIPTION OF THE INVENTION 

The present invention provides a method to compare the 
relative abundance of gene transcripts in different 
biological specimens by the use of high-throughput 
sequence-specific analysis of individual RNAs or their 

30 corresponding cDNAs (or alternatively, of data representing 
other biological sequences) . This process is denoted 
herein as gene transcript imaging. The quantitative 
analysis of the relative abundance for a set of gene 
transcripts is denoted herein as "gene transcript image 

35 analysis" or "gene transcript frequency analysis". The 
present invention allows one to obtain a profile for gene 
transcription in any given population of cells or tissue 
from any type of organism. The invention can be applied to 
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obtain a profile of a specimen consisting of a single cell 
(or clones of a single cell) , or of many cells, or of 
tissue more complex than a single cell and containing 
multiple cell types, such as liver. 
5 The invention has significant advantages in the fields 

of diagnostics, toxicology and pharmacology, to name a few. 
A highly sophisticated diagnostic test can be performed on 
the ill patient in whom a diagnosis has not been made. A 
biological specimen consisting of the patient's fluids or 

10 tissues is obtained, and the gene transcripts are isolated 
and expanded to the extent necessary to determine their 
identity. Optionally, the gene transcripts can be 
converted to cDNA. A sampling of the gene transcripts are 
subjected to sequence-specific analysis and quantified. 

15 These gene transcript sequence abundances are compared 
against reference database sequence abundances including 
normal data sets for diseased and healthy patients. The 
patient has the disease(s) with which the patient's data 
set most closely correlates. 

20 For example, gene transcript frequency analysis can be 

used to differentiate normal cells or tissues from diseased 
cells or tissues, just as it highlights differences between 
normal monocytes and activated macrophages in Table 3. 

In toxicology, a fundamental question is which tests 

25 are most effective in predicting or detecting a toxic 

effect. Gene transcript imaging provides highly detailed 
information on the cell and tissue environment, some of 
which would not be obvious in conventional, less detailed 
screening methods. The gene transcript image is a more 

30 powerful method to predict drug toxicity and efficacy. 
Similar benefits accrue in the use of this tool in 
pharmacology. The gene transcript image can be used 
selectively to look at protein categories which are 
expected to be affected, for example, enzymes which 

35 detoxify toxins. 

In an alternative embodiment, comparative gene 
transcript frequency analysis is used to differentiate 
between cancer cells which respond to anti-cancer agents 
and those which do not respond. Examples of anti-cancer 
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agents are tamoxifen, vincristine, vinblastine, 
podophyllotoxins, etoposide, tenisposide, cisplatin, 
biologic response modifiers such as interferon, 11-2, GM- 
CSF, enzymes, hormones and the like. This method also 
5 provides a means for sorting the gene transcripts by 
functional category. In the case of cancer cells, 
transcription factors or other essential regulatory 
molecules are very important categories to analyze across 
different libraries. 

10 In yet another embodiment, comparative gene transcript 

frequency analysis is used to differentiate between control 
liver cells and liver cells isolated from patients treated 
with experimental drugs like FIAU to distinguish between 
pathology caused by the underlying disease and that caused 

15 by the drug. 

In yet another embodiment, comparative gene transcript 
frequency analysis is used to differentiate between brain 
tissue from patients treated and untreated with lithium. 
In a further embodiment, comparative gene transcript 

20 frequency analysis is used to differentiate between 
cyclosporin and FK506-treated cells and normal cells. 

In a further embodiment, comparative gene transcript 
frequency analysis is used to differentiate between virally 
infected (including HIV-infected) human cells and 

25 uninfected human cells. Gene transcript frequency analysis 
is also used to rapidly survey gene transcripts in HIV- 
resistant, HIV-infected, and HIV-sensitive cells. 
Comparison of gene transcript abundance will indicate the 
success of treatment and/or new avenues to study. 

30 In a further embodiment, comparative gene transcript 

frequency analysis is used to differentiate between 
bronchial lavage fluids from healthy and unhealthy patients 
with a variety of ailments. 

In a further embodiment, comparative gene transcript 

35 frequency analysis is used to differentiate between cell, 
plant, microbial and animal mutants and wild-type species. 
In addition, the transcript abundance program is adapted to 
permit the scientist to evaluate the transcription of one 
gene in many different tissues. Such comparisons could 
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identify deletion mutants which do not produce a gene 
product and point mutants which produce a less abundant or 
otherwise different message. Such mutations can affect 
basic biochemical and pharmacological processes, such as 
5 mineral nutrition and metabolism, and can be isolated by 
means known to those skilled in the art. Thus, crops with 
improved yields, pest resistance and other factors can be 
developed. 

In a further embodiment, comparative gene transcript 

10 frequency analysis is used for an interspecies comparative 
analysis which would allow for the selection of better 
pharmacologic animal models. In this embodiment, humans 
and other animals (such as a mouse) , or their cultured 
cells are treated with a specific test agent. The relative 

15 sequence abundance of each cDNA population is determined. 
* If the animal test system is a good model, homologous genes 
in the animal cDNA population should change expression 
similarly to those in human cells. If side effects are 
detected with the drug, a detailed transcript abundance 

20 analysis will be performed to survey gene transcript 

changes. Models will then be evaluated by comparing basic 
physiological changes. 

In a further embodiment, comparative gene transcript 
frequency analysis is used in a clinical setting to give a 

25 highly detailed gene transcript profile of a patient's 
cells or tissue (for example, a blood sample) . In 
particular, gene transcript frequency analysis is used to 
give a high resolution gene expression profile of a 
diseased state or condition. 

30 In the preferred embodiment, the method utilizes 

high-throughput cDNA sequencing to identify specific 
transcripts of interest. The generated cDNA and deduced 
amino acid sequences are then extensively compared with 
GENBANK and other sequence data banks as described below. 

35 The method offers several advantages over current protein 
discovery by two-dimensional gel methods which try to 
identify individual proteins involved in a particular 
biological effect. Here, detailed comparisons of profiles 
of activated and inactive cells reveal numerous changes in 
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the expression of individual transcripts. After it is 
determined if the sequence is an "exact" match, similar or 
a non-match, the sequence is entered into a database. 
Next, the numbers of copies of cDNA corresponding to each 
5 gene are tabulated. Although this can be done slowly and 
arduously, if at all, by human hand from a printout of all 
entries, a computer program is a useful and rapid way to 
tabulate this information. The numbers of cDNA copies 
(optionally divided by the total number of sequences in the 

10 data set) provides a picture of the relative abundance of 
transcripts for each corresponding gene. The list of 
represented genes can then be sorted by abundance in the 
cDNA population. A multitude of additional types of 
comparisons or dimensions are possible and are exemplified 

15 below. 

An alternate method of producing a gene transcript 
image includes the steps of obtaining a mixture of test 
mRNA and providing a representative array of unique probes 
whose sequences are complementary to at least some of the 

20 test mRNAs. Next, a fixed amount of the test mRNA is added 
to the arrayed probes. The test mRNA is incubated with the 
probes for a sufficient time to allow hybrids of the test 
mRNA and probes to form. The mRNA-probe hybrids are 
detected and the quantity determined. The hybrids are 

25 identified by their location in the probe array. The 
quantity of each hybrid is summed to give a population 
number. Each hybrid quantity is divided by the population 
number to provide a set of relative abundance data termed a 
gene transcript image analysis. 

30 6. EXAMPLES 

The examples below are provided to illustrate the 
subject invention. These examples are provided by way of 
illustration and are not included for the purpose of 
limiting the invention. 

35 6.1, TISSUE SOURCES AND CELL LINES 

For analysis with the computer program claimed herein, 
biological sequences can be obtained from virtually any 
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source. Most popular are tissues obtained from the human 
body. Tissues can be obtained from any organ of the body, 
any age donor, any abnormality or any immortalized cell 
line. Immortal cell lines may be preferred in some 
5 instances because of their purity of cell type; other 
tissue samples invariably include mixed cell types. A 
special technique is available to take a single cell (for 
example, a brain cell) and harness the cellular machinery 
to grow up sufficient cDNA for sequencing by the techniques 

10 and analysis described herein (cf. U.S. Patent Nos. 
5,021,335 and 5,168,038, which are incorporated by 
reference) . The examples given herein utilized the 
following immortalized cell lines: monocyte-like U-937 
cells, activated macrophage-like THP-1 cells, induced 

15 vascular endothelial cells (HUVEC cells) and mast cell-like 
HMC-1 cells. 

The U-937 cell line is a human histiocytic lymphoma 
cell line with monocyte characteristics, established from 
malignant cells obtained from the pleural effusion of a 

20 patient with diffuse histiocytic lymphoma (Sundstrom, C. 
and Nilsson, K. (1976) Int. J. Cancer 17:565). U-937 is 
one of only a few human cell lines with the morphology, 
cytochemistry, surface receptors and monocyte-like 
characteristics of histiocytic cells. These cells can be 

25 induced to terminal monocytic differentiation and will 
express new cell surface molecules when activated with 
supernatants from human mixed lymphocyte cultures. Upon 
this type of in vitro activation, the cells undergo 
morphological and functional changes, including 

30 augmentation of antibody-dependent cellular cytotoxicity 

(ADCC) against erythroid and tumor target cells (one of the 
principal functions of macrophages) . Activation of U-937 
cells with phorbol 12-myristate 13-acetate (PMA) in vitro 
stimulates the production of several compounds, including 

35 prostaglandins, leukotrienes and platelet-activating factor 
(PAF) , which are potent inflammatory mediators. Thus, U- 
937 is a cell line that is well suited for the 
identification and isolation of gene transcripts associated 
with normal monocytes. 
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The HUVEC cell line is a normal, homogeneous, well 
characterized, early passage endothelial cell culture from 
human umbilical vein (Cell Systems Corp., 12815 NE 124th 
Street, Kirkland, WA 98034). Only gene transcripts from 
5 induced, or treated, HUVEC cells were sequenced. One batch 
of 1 X 10 8 cells was treated for 5 hours with 1 U/ml rIL-lb 
and 100 ng/ml E.coli lipopolysaccharide (LPS) endotoxin 
prior to harvesting. A separate batch of 2 X 10 8 cells was 
treated at confluence with 4 U/ml TNF and 2 U/ml 

10 interf eron-gamma (IFN-gamma) prior to harvesting. 

THP-1 is a human leukemic cell line with distinct 
monocytic characteristics. This cell line was derived from 
the blood of a 1-year-old boy with acute monocytic leukemia 
(Tsuchiya, S. et al. (1980) Int. J. Cancer: 171-76). The 

15 following cytological and cytochemical criteria were used 
to determine the monocytic nature of the cell line: 1) the 
presence of alpha-naphthyl butyrate esterase activity which 
could be inhibited by sodium fluoride; 2) the production of 
lysozyme; 3) the phagocytosis of latex particles and 

20 sensitized SRBC (sheep red blood cells); and 4) the ability 
of mitomycin C-treated THP-1 cells to activate T- 
lymphocytes following ConA (concanavalin A) treatment. 
Morphologically, the cytoplasm contained small azurophilic 
granules and the nucleus was indented and irregularly 

25 shaped with deep folds. The cell line had Fc and C3b 
receptors, probably functioning in phagocytosis. THP-1 
cells treated with the tumor promoter 12-o-tetradecanoyl- 
phorbol-13 acetate (TPA) stop proliferating and 
differentiate into macrophage-like cells which mimic native 

30 monocyte-derived macrophages in several respects. 

Morphologically, as the cells change shape, the nucleus 
becomes more irregular and additional phagocytic vacuoles 
appear in the cytoplasm. The differentiated THP-1 cells 
also exhibit an increased adherence to tissue culture 

35 plastic. 

HMC-1 cells (a human mast cell line) were established 
from the peripheral blood of a Mayo Clinic patient with 
mast cell leukemia (Leukemia Res. (1988) 12:345-55). The 
cultured cells looked similar to immature cloned murine 
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mast cells, contained histamine, and stained positively for 
chloroacetate esterase, amino caproate esterase, eosinophil 
major basic protein (MBP) and tryptase. The HMC-1 cells 
have, however, lost the ability to synthesize normal IgE 
5 receptors. HMC-1 cells also possess a 10; 16 translocation, 
present in cells initially collected by leukophoresis from 
the patient and not an artifact of culturing. Thus, HMC-1 
cells are a good model for mast cells. 

6.2. CONSTRUCTION OF cDNA LIBRARIES 

10 For inter-library comparisons, the libraries must be 

prepared in similar manners. Certain parameters appear to 
be particularly important to control. One such parameter 
is the method of isolating mRNA. It is important to use 
the same conditions to remove DNA and heterogeneous nuclear 

15 RNA from comparison libraries. Size fractionation of cDNA 
must be carefully controlled. The same vector preferably 
should be used for preparing libraries to be compared. At 
the very least, the same type of vector (e.g., 
unidirectional vector) should be used to assure a valid 

20 comparison. A unidirectional vector may be preferred in 
order to more easily analyze the output. 

It is preferred to prime only with oligo dT 
unidirectional primer in order to obtain one only clone per 
mRNA transcript when obtaining cDNAs. However, it is 

25 recognized that employing a mixture of oligo dT and random 
primers can also be advantageous because such a mixture 
results in more sequence diversity when gene discovery also 
is a goal. Similar effects can be obtained with DR2 
(Clontech) and HXLOX (US Biochemical) and also vectors from 

30 Invitrogen and Novagen. These vectors have two 

requirements. First, there must be primer sites for 
commercially available primers such as T3 or M13 reverse 
primers. Second, the vector must accept inserts up to 10 
kB. 

35 It also is important that the clones be randomly 

sampled, and that a significant population of clones is 
used. Data have been generated with 5,000 clones; however, 
if very rare genes are to be obtained and/or their relative 
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abundance determined, as many as 100,000 clones from a 
single library may need to be sampled. Size fractionation 
of cDNA also must be carefully controlled. Alternately, 
plaques can be selected, rather than clones. 
5 Besides the Uni-ZAP™ vector system by Stratagene 

disclosed below, it is now believed that other similarly 
unidirectional vectors also can be used. For example, it 
is believed that such vectors include but are not limited 
to DR2 (Clontech) , and HXLOX (U.S. Biochemical). 

10 Preferably, the details of library construction (as 

shown in Figure 1) are collected and stored in a database 
for later retrieval relative to the sequences being 
compared. Fig. 1 shows important information regarding the 
library collaborator or cell or cDNA supplier, 

15 pretreatment, biological source, culture, mRNA preparation 
■ and cDNA construction. Similarly detailed information 
about the other steps is beneficial in analyzing sequences 
and libraries in depth. 

RNA must be harvested from cells and tissue samples 

20 and cDNA libraries are subsequently constructed. cDNA 

libraries can be constructed according to techniques known 
in the art. (See, for example, Maniatis, T. et al. (1982) 
Molecular Cloning, Cold Spring Harbor Laboratory, New 
York) . cDNA libraries may also be purchased. The U-937 

25 cDNA library (catalog No. 937207) was obtained from 

Stratagene, Inc., 11099 M. Torrey Pines Rd., La Jolla, CA 
92037. 

The THP-l cDNA library was custom constructed by 
Stratagene from THP-l cells cultured 4 8 hours with 100 nm 

30 TPA and 4 hours with 1 /xg/ml LPS. The human mast cell HMC- 
1 cDNA library was also custom constructed by Stratagene 
from cultured HMC-l cells. The HUVEC cDNA library was 
custom constructed by Stratagene from two batches of 
induced HUVEC cells which were separately processed. 

35 Essentially, all the libraries were prepared in the 

same manner. First, poly (A+) RNA (mRNA) was purified. For 
the U-937 and HMC-1 RNA, cDNA synthesis was only primed 
with oligo dT. For the THP-l and HUVEC RNA, cDNA synthesis 
was primed separately with both oligo dT and random 
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hexamers, and the two cDNA libraries were treated 
separately. Synthetic adaptor oligonucleotides were 
ligated onto cDNA ends enabling its insertion into the Uni- . 
Zap™ vector system (Stratagene), allowing high efficiency 
5 unidirectional (sense orientation) lambda library 

construction and the convenience of a plasmid system with 
blue-white color selection to detect clones with cDNA 
insertions. Finally, the two libraries were combined into 
a single library by mixing equal numbers of bacteriophage. 

10 The libraries can be screened with either DNA probes 

or antibody probes and the pBluescript® phagemid 
(Stratagene) can be rapidly excised in vivo . The phagemid 
allows the use of a plasmid system for easy insert 
characterization, sequencing, site-directed mutagenesis, 

15 the creation of unidirectional deletions and expression of 
fusion proteins. The custom-constructed library phage 
particles were infected into E. coli host strain XLl-Blue® 
(Stratagene) , which has a high transformation efficiency, 
increasing the probability of obtaining rare, under- 

20 represented clones in the cDNA library. 

6.3. ISOLATION OF cDNA CLONES 
The phagemid forms of individual cDNA clones were 
obtained by the in vivo excision process, in which the host 
bacterial strain was coinfected with both the lambda 

25 library phage and an fl helper phage. Proteins derived 

from both the library-containing phage and the helper phage 
nicked the lambda DNA, initiated new DNA synthesis from 
defined sequences on the lambda target DNA and created a 
smaller, single stranded circular phagemid DNA molecule 

30 that included all DNA sequences of the pBluescript® plasmid 
and the cDNA insert. The phagemid DNA was secreted from 
the cells and purified, then used to re-infect fresh host 
cells, where the double stranded phagemid DNA was produced. 
Because the phagemid carries the gene for beta-lactamase, 

35 the newly-transformed bacteria are selected on medium 
containing ampicillin. 

Phagemid DNA was purified using the Magic Minipreps™ 
DNA Purification System (Promega catalogue #A7100. Promega 
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Corp., 2800 Woods Hollow Rd. , Madison, WI 53711). This 
small-scale process provides a simple and reliable method 
for lysing the bacterial cells and rapidly isolating 
purified phagemid DNA using a proprietary DNA-binding 
5 resin. The DNA was eluted from the purification resin 
already prepared for DNA sequencing and other analytical 
manipulations. 

Phagemid DNA was also purified using the QIAwell-8 
Plasmid Purification System from QIAGEN® DNA Purification 

10 System (QIAGEN Inc., 9259 Eton Ave., Chattsworth, CA 

91311). This product line provides a convenient, rapid and 
reliable high-throughput method for lysing the bacterial 
cells and isolating highly purified phagemid DNA using 
QIAGEN anion-exchange resin particles with EMPOREJ™ membrane 

15 technology from 3M in a multiwell format. The DNA was 

eluted from the purification resin already prepared for DNA 
sequencing and other analytical manipulations. 

An alternate method of purifying phagemid has recently 
become available. It utilizes the Miniprep Kit (Catalog 

20 No. 77468, available from Advanced Genetic Technologies 
Corp., 19212 Orbit Drive, Gaithersburg, Maryland). This 
kit is in the 96-well format and provides enough reagents 
for 960 purifications. Each kit is provided with a 
recommended protocol, which has been employed except for 

25 the following changes. First, the 96 wells are each filled 
with only 1 ml of sterile terrific broth with carbenicillin 
at 25 mg/L and glycerol at 0.4%. After the wells are 
inoculated, the bacteria are cultured for 24 hours and 
lysed with 60 /zl of lysis buffer. A centrif ugation step 

30 (2900 rpm for 5 minutes) is performed before the contents 
of the block are added to the primary filter plate. The 
optional step of adding isopropanol to TRIS buffer is not 
routinely performed. After the last step in the protocol, 
samples are transferred to a Beckman 96-well block for 

35 storage. 

Another new DNA purification system is the WIZARD™ 
product line which is available from Promega (catalog No. 
A7071) and may be adaptable to the 96-well format. 
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6.4. 



SEQUENCING OF CDNA CLONES 



The cDNA inserts from random isolates of the U-937 and 
THP-1 libraries were sequenced in part. Methods for DNA 
sequencing are well known in the art. Conventional 
5 enzymatic methods employ DNA polymerase Klenow fragment, 
Sequenase™ or Taq polymerase to extend DNA chains from an 
oligonucleotide primer annealed to the DNA template of 
interest. Methods have been developed for the use of both 
single- and double-stranded templates. The chain 

10 termination reaction products are usually electrophoresed 
on urea-acrylamide gels and are detected either by 
autoradiography (for radionuclide-labeled precursors) or by 
fluorescence (for fluorescent-labeled precursors) . Recent 
improvements in mechanized reaction preparation, sequencing 

15 and analysis using the fluorescent detection method have 
permitted expansion in the number of sequences that can be 
determined per day (such as the Applied Biosystems 373 and 
377 DNA sequencer, Catalyst 800) . Currently with the 
system as described, read lengths range from 250 to 400 

20 bases and are clone dependent. Read length also varies 
with the length of time the gel is run. In general, the 
shorter runs tend to truncate the sequence. A minimum of 
only about 25 to 50 bases is necessary to establish the 
identification and degree of homology of the sequence. 

25 Gene transcript imaging can be used with any sequence- 
specific method, including, but not limited to 
hybridization, mass spectroscopy, capillary electrophoresis 
and 505 gel electrophoresis. 



Using the nucleotide sequences derived from the cDNA 
clones as query sequences (sequences of a Sequence 
Listing) , databases containing previously identified 
sequences are searched for areas of homology (similarity) . 
35 Examples of such databases include Genbank and EMBL. We 
next describe examples of two homology search algorithms 
that can be used, and then describe the subsequent 
computer-implemented steps to be performed in accordance 
with preferred embodiments of the invention. 



6.5. 



HOMOLOGY SEARCHING OF cDNA CLONE AND 
DEDUCED PROTEIN (and Subsequent Steps) 



30 
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In the following description of the computer- 
implemented steps of the invention, the word "library" 
denotes a set (or population) of biological specimen 
nucleic acid sequences. A "library" can consist of cDNA 
5 sequences, RNA sequences, or the like, which characterize a 
biological specimen. The biological specimen can consist 
of cells of a single human cell type (or can be any of the 
other above-mentioned types of specimens) . We contemplate 
that the sequences in a library have been determined so as 
10 to accurately represent or characterize a biological 

specimen (for example, they can consist of representative 
cDNA sequences from clones of RNA taken from a single human 
cell) . 

In the following description of the computer- 
15 implemented steps of the invention, the expression 

"database" denotes a set of stored data which represent a 
collection of sequences, which in turn represent a 
collection of biological reference materials. For example, 
a database can consist of data representing many stored 
20 cDNA sequences which are in turn representative of human 
cells infected with various viruses, cells of humans of 
various ages, cells from different mammalian species, and 
so on. 

In preferred embodiments, the invention employs a 
25 computer programmed with software (to be described) for 
performing the following steps: 

(a) processing data indicative of a library of cDNA 
sequences (generated as a result of high-throughput cDNA 
sequencing or other method) to determine whether each 

30 sequence in the library matches a DNA sequence of a 

reference database of DNA sequences (and if so, identifying 
the reference database entry which matches the sequence and 
indicating the degree of match between the reference 
sequence and the library sequence) and assigning an 

35 identified sequence value based on the sequence annotation 
and degree of match to each of the sequences in the 
library; 

(b) for some or all entries of the database, 
tabulating the number of matching identified sequence 
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values in the library (Although this can be done by human 
hand from a printout of all entries, we prefer to perform 
this step using computer software to be described below.)/ 
thereby generating a set of final data values or "abundance 
5 numbers"; and 

(c) if the libraries are different sizes, dividing 
each abundance number by the total number of sequences in 
the library, to obtain a relative abundance number for each 
identified sequence value (i.e., a relative abundance of 
10 each gene transcript) . 

The list of identified sequence values (or genes 
corresponding thereto) can then be sorted by abundance in 
the cDNA population. A multitude of additional types of 
comparisons or dimensions are possible. 
15 For example (to be described below in greater detail) , 

steps (a) and (b) can be repeated for two different 
libraries (sometimes referred to as a "target" library and 
a "subtractant" library) . Then, for each identified 
sequence value (or gene transcript) , a "ratio" value is 
20 obtained by dividing the abundance number (for that 

identified sequence value) for the target library, by the 
abundance number (for that identified sequence value) for 
the subtractant library. 

In fact, subtraction may be carried out on multiple 
25 libraries. It is possible to add the transcripts from 

several libraries (for example, three) and then to divide 
them by another set of transcripts from multiple libraries 
(again, for example, three) . Notation for this operation 
may be abbreviated as (A+B+C) / (D+E+F) , where the capital 
30 letters each indicate an entire library. Optionally the 
abundance numbers of transcripts in the summed libraries 
may be divided by the total sample size before subtraction. 

Unlike standard hybridization technology which permits 
a single subtraction of two libraries, once one has 
35 processed a set or library transcript sequences and stored 
them in the computer, any number of subtractions can be 
performed on the library. For example, by this method, 
ratio values can be obtained by dividing relative abundance 
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values in a first library by corresponding values in a 
second library and vice versa. 

In variations on step (a) , the library consists of 
nucleotide sequences derived from cDNA clones. Examples of 
5 databases which can be searched for areas of homology 

(similarity) in step (a) include the commercially available 
databases known as Genbank (NIH) EMBL (European Molecular 
Biology Labs, Germany), and GENESEQ (Intelligenetics, 
Mountain View, California) . 

10 One homology search algorithm which can be used to 

implement step (a) is the algorithm described in the paper 
by D.J. Lipman and W.R. Pearson, entitled "Rapid and 
Sensitive Protein Similarity Searches," Science . 227:1435 
(1985). In this algorithm, the homologous regions are 

15 searched in a two-step manner. In the first step, the 

highest homologous regions are determined by calculating a 
matching score using a homology score table. The parameter 
"Ktup" is used in this step to establish the minimum window 
size to be shifted for comparing two sequences. Ktup also 

2 0 sets the number of bases that must match to extract the 
highest homologous region among the sequences. In this 
step, no insertions or deletions are applied and the 
homology is displayed as an initial (INIT) value. 

In the second step, the homologous regions are aligned 

25 to obtain the highest matching score by inserting a gap in 
order to add a probable deleted portion. The matching 
score obtained in the first step is recalculated using the 
homology score Table and the insertion score Table to an 
optimized (OPT) value in the final output. 

30 DNA homologies between two sequences can be examined 

graphically using the Harr method of constructing dot 
matrix homology plots (Needleman, S.B. and Wunsch, CO., J. 
Mom. Biol 48:443 (1970)). This method produces a 
two-dimensional plot which can be useful in determining 

35 regions of homology versus regions of repetition. 

However, in a class of preferred embodiments, step (a) 
is implemented by processing the library data in the 
commercially available computer program known as the 
INHERIT 670 Sequence Analysis System, available from 
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Applied Biosystems Inc. (Foster City, California), 
including the software known as the Factura software (also 
available from Applied Biosystems Inc.) . The Factura 
program preprocesses each library sequence to "edit out" 
5 portions thereof which are not likely to be of interest, 
such as the vector used to prepare the library. Additional 
sequences which can be edited out or masked (ignored by the 
search tools) include but are not limited to the polyA tail 
and repetitive GAG and CCC sequences. A low-end search* 

10 program can be written to mask out such "low-information" 
sequences, or programs such as BLAST can ignore the low- 
information sequences. 

In the algorithm implemented by the INHERIT 670 
Sequence Analysis System, the Pattern Specification 

15 Language (developed by TRW Inc.) is used to determine 
regions of homology. "There are three parameters that 
determine how INHERIT analysis runs sequence comparisons: 
window size, window offset and error tolerance. Window 
size specifies the length of the segments into which the 

20 query sequence is subdivided. Window offset specifies 

where to start the next segment [to be compared], counting 
from the beginning of the previous segment. Error 
tolerance specifies the total number of insertions, 
deletions and/or substitutions that are tolerated over the 

25 specified word length. Error tolerance may be set to any 
integer between 0 and 6. The default settings are window 
tolerance=20, window offset=10 and error tolerance=3 . " 
INHERIT Analysis Users Manual , pp. 2-15. Version 1.0, 
Applied Biosystems, Inc., October 1991. 

30 Using a combination of these three parameters, a 

. database (such as a DNA database) can be searched for 
sequences containing regions of homology and the 
appropriate sequences are scored with an initial value. 
Subsequently, these homologous regions are examined using 

35 dot matrix homology plots to determine regions of homology 
versus regions of repetition. Smith-Waterman alignments 
can be used to display the results of the homology search. 
The INHERIT software can be executed by a Sun computer 
system programmed with the UNIX operating system. 
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Search alternatives to INHERIT include the BLAST 
program, GCG (available from the Genetics Computer Group, 
WI) and the Dasher program (Temple Smith, Boston 
University, Boston, MA) . Nucleotide sequences can be 
5 searched against Genbank, EMBL or custom databases such as 
GENESEQ (available from Intelligenetics , Mountain View, CA) 
or other databases for genes. In addition, we have 
searched some sequences against our own in-house database. 
In preferred embodiments, the transcript sequences are 

10 analyzed by the INHERIT software for best conformance with 
a reference gene transcript to assign a sequence identifier 
and assigned the degree of homology, which together are the 
identified sequence value and are input into, and further 
processed by, a Macintosh personal computer (available from 

15 Apple) programmed with an "abundance sort and subtraction 
analysis" computer program (to be described below) . 

Prior to the abundance sort and subtraction analysis 
program (also denoted as the "abundance sort" program) , 
identified sequences from the cDNA clones are assigned 

20 value (according to the parameters given above) by degree 
of match according to the following categories: "exact" 
matches (regions with a high degree of identity), 
homologous human matches (regions of high similarity, but 
hot "exact" matches) , homologous non-human matches (regions 

25 of high similarity present in species other than human) , or 
non matches (no significant regions of homology to 
previously identified nucleotide sequences stored in the 
form of the database). Alternately, the degree of match 
can be a numeric value as described below. 

30 With reference again to the step of identifying 

matches between reference sequences and database entries, 
protein and peptide sequences can be deduced from the 
nucleic acid sequences. Using the deduced polypeptide 
sequence, the match identification can be performed in a 

35 manner analogous to that done with cDNA sequences. A 

protein sequence is used as a query sequence and compared 
to the previously identified sequences contained in a 
database such as the Swiss/Prot, PIR and the NBRF Protein 
database to find homologous proteins. These proteins are 
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initially scored for homology using a homology score Table 
(Orcutt, B.C. and Dayoff, M.O. Scoring Matrices, PIR 
Report MAT - 0285 (February 1985)) resulting in an INIT 
score. The homologous regions are aligned to obtain the 
5 highest matching scores by inserting a gap which adds a 
probable deleted portion. The matching score is 
recalculated using the homology score Table and the 
insertion score Table resulting in an optimized (OPT) 
score. Even in the absence of knowledge of the proper 

10 reading frame of an isolated sequence, the above-described 
protein homology search may be performed by searching all 3 
reading frames. 

Peptide and protein sequence homologies can also be 
ascertained using the INHERIT 670 Sequence Analysis System 

15 in an analogous way to that used in DNA sequence 

homologies. Pattern Specification Language and parameter 
windows are used to search protein databases for sequences 
containing regions of homology which are scored with an 
initial value. Subsequent display in a dot-matrix homology 

20 plot shows regions of homology versus regions of 

repetition. Additional search tools that are available to 
use on pattern search databases include PLsearch Blocks 
(available from Henikoff & Henikoff, University of 
Washington, Seattle) , Dasher and GCG. Pattern search 

25 databases include, but are not limited to, Protein Blocks 
(available from Henikoff & Henikoff, University of 
Washington, Seattle) , Brookhaven Protein (available from 
the Brookhaven National Laboratory, Brookhaven, MA) , 
PROSITE (available from Amos Bairoch, University of Geneva, 

30 Switzerland) , ProDom (available from Temple Smith, Boston 
University) , and PROTEIN MOTIF FINGERPRINT (available from 
University of Leeds, United Kingdom) . 

The ABI Assembler application software, part of the 
INHERIT DNA analysis system (available from Applied 

35 Biosystems, Inc., Foster City, CA) , can be employed to 

create and manage sequence assembly projects by assembling 
data from selected sequence fragments into a larger 
sequence. The Assembler software combines two advanced 
computer technologies which maximize the ability to 
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assemble sequenced DNA fragments into Assemblages, a 
special grouping of data where the relationships between 
sequences are shown by graphic overlap , alignment and 
statistical views. The process is based on the 
5 Meyers-Kececioglu model of fragment assembly (INHERIT™ 
Assembler User's Manual, Applied Biosystems, Inc., Foster 
City, CA) , and uses graph theory as the foundation of a 
very rigorous multiple sequence alignment engine for 
assembling DNA sequence fragments. Other assembly programs 

10 that can be used include MEGALIGN (available from DNASTAR 
Inc., Madison, WI) , Dasher and STADEN (available from Roger 
Staden, Cambridge, England) . 

Next, with reference to Fig. 2, we describe in more 
detail the "abundance sort" program which implements above- 

15 mentioned "step (b)" to tabulate the number of sequences of 
• the library which match each database entry (the "abundance 
number" for each database entry) . 

Fig. 2 is a flow chart of a preferred embodiment of 
the abundance sort program, A source code listing of this 

20 embodiment of the abundance sort program is set forth in 

Table 5. In the Table 5 implementation, the abundance sort 
program is written using the FoxBASE programming language 
commercially available from Microsoft Corporation. 
Although FoxBASE was the program chosen for the first 

25 iteration of this technology, it should not be considered 
limiting. Many other programming languages, Sybase being a 
particularly desirable alternative, can also be used, as 
will be obvious to one with ordinary skill in the art. The 
subroutine names specified in Fig. 2 correspond to 

30 subroutines listed in Table 5. 

With reference again to Fig. 2, the "Identified 
Sequences" are transcript sequences representing each 
sequence of the library and a corresponding identification 
of the database entry (if any) which it matches. In other 

35 words, the "Identified Sequences" are transcript sequences 
representing the output of above-discussed "step (a)." 

Fig. 3 is a block diagram of a system for implementing 
the invention. The Fig. 3 system includes library 
generation unit 2 which generates a library and asserts an 
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output stream of transcript sequences indicative of the 
biological sequences comprising the library. Programmed 
processor 4 receives the data stream output from unit 2 and 
processes this data in accordance with above-discussed 
5 "step fa) " to generate the Identified Sequences. Processor 
4 can be a processor programmed with the commercially 
available computer program known as the INHERIT 670 
Sequence Analysis System and the commercially available 
computer program known as the Factura program (both 

10 available from Applied Biosystems Inc.) and with the UNIX 
operating system. 

Still with reference to Fig. 3, the Identified 
Sequences are loaded into processor 6 which is programmed 
with the abundance sort program. Processor 6 generates the 

15 Final Transcript sequences indicated in both Figs. 2 and 3. 
Fig. 4 shows a more detailed block diagram of a planned 
relational computer system, including various searching 
techniques which can be implemented, along with an 
assortment of databases to query against. 

20 With reference to Fig. 2, the abundance sort program 

first performs an operation known as "Tempnum" on the 
Identified Sequences, to discard all of the Identified 
Sequences except those which match database entries of 
selected types. For example, the Tempnum process can 

25 select Identified Sequences which represent matches of the 
following types with database entries (see above for 
definition) : "exact" matches, human "homologous" matches, 
"other species" matches representing genes present in 
species other than human) , "no" matches (no significant 

30 regions of homology with database entries representing 
previously identified nucleotide sequences) , "I" matches 
(Incyte for not previously known DNA sequences) , or "X" 
matches (matches ESTs in reference database) . This 
eliminates the U, S, M, V, A, R and D sequence (see Table 1 

35 for definitions) . 

The identified sequence values selected during the 
"Tempnum" process then undergo a further selection (weeding 
out) operation known as "Tempred." This operation can, for 



30 



WO 95/20681 PCT/US95/01160 

example, discard all identified sequence values 
representing matches with selected database entries. 

The identified sequence values selected during the 
"Tempred" process are then classified according to library, 
5 during the "Tempdesig" operation. It is contemplated that 
the "Identified Sequences" can represent sequences from a 
single library, or from two or more libraries. 

Consider first the case that the identified sequence 
values represent sequences from a single library. In this 

10 case, all the identified sequence values determined during 
"Tempred" undergo sorting in the "Templib" operation, 
further sorting in the "Libsort" operation, and finally 
additional sorting in the "Temptarsort" operation. For 
example, these three sorting operations can sort the 

15 identified sequences in order of decreasing "abundance 
number" (to generate a list of decreasing abundance 
numbers, each abundance number corresponding to a unique 
identified sequence entry, or several lists of decreasing 
abundance numbers, with the abundance numbers in each list 

20 corresponding to database entries of a selected type) with 
redundancies eliminated from each sorted list. In this 
case, the operation identified as "Cruncher" can be 
bypassed, so that the "Final Data" values are the organized 
transcript sequences produced during the "Temptarsort" 

25 operation. 

We next consider the case that the transcript 
sequences produced during the "Tempred" operation represent 
sequences from two libraries (which we will denote the 
"target" library and the "subtractant" library) . For 

30 example, the target library may consist of cDNA sequences 
from clones of a diseased cell, while the subtractant 
library may consist of cDNA sequences from clones of the 
diseased cell after treatment by exposure to a drug. For 
another example, the target library may consist of cDNA 

35 sequences from clones of a cell type from a young human, 

while the subtractant library may consist of cDNA sequences 
from clones of the same cell type from the same human at 
different ages. 
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In this case, the "Tempdesig" operation routes all 
transcript sequences representing the target library for 
processing in accordance with "Templib" (and then "Libsort" 
and "Temptarsort") , and routes all transcript sequences 
5 representing the subtractant library for processing in 
accordance with "Tempsub" (and then "Subsort" and 
"Tempsubsort"). For example, the consecutive "Templib," 
"Libsort," and "Temptarsort" sorting operations sort 
identified sequences from the target library in order of 

10 decreasing abundance number (to generate a list of 
decreasing abundance numbers, each abundance number 
corresponding to a database entry, or several lists of 
decreasing abundance numbers, with the abundance numbers in 
each list corresponding to database entries of a selected 

15 type) with redundancies eliminated from each sorted list. 
*The consecutive "Tempsub, 11 "Subsort," and "Tempsubsort" 
sorting operations sort identified sequences from the 
subtractant library in order of decreasing abundance number 
(to generate a list of decreasing abundance numbers, each 

20 abundance number corresponding to a database entry, or 
several lists of decreasing abundance numbers, with the 
abundance numbers in each list corresponding to database 
entries of a selected type) with redundancies eliminated 
from each sorted list. 

25 The transcript sequences output from the "Temptarsort" 

operation typically represent sorted lists from which a 
histogram could be generated in which position along one 
(e.g., horizontal) axis indicates abundance number (of 
target library sequences) , and position along another 

30 (e.g., vertical) axis indicates identified sequence value 
(e.g., human or non-human gene type). Similarly, the 
transcript sequences output from the "Tempsubsort" 
operation typically represent sorted lists from which a 
histogram could be generated in which position along one 

35 (e.g., horizontal) axis indicates abundance number (of 

subtractant library sequences) , and position along another 
(e.g., vertical) axis indicates identified sequence value 
(e.g., human or non-human gene type). 
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The transcript sequences (sorted lists) output from 
the Tempsubsort and Temptarsort sorting operations are 
combined during the operation identified as "Cruncher." 
The "Cruncher" process identifies pairs of corresponding 
5 target and subtractant abundance numbers (both representing 
the same identified sequence value), and divides one by the 
other to generate a "ratio" value for each pair of 
corresponding abundance numbers, and then sorts the ratio 
values in order of decreasing ratio value. The data output 

10 from the "Cruncher" operation (the Final Transcript 

sequence in Fig. 2) is typically a sorted list from which a 
histogram could be generated in which position along one 
axis indicates the size of a ratio of abundance numbers 
(for corresponding identified sequence values from target 

15 and subtractant libraries) and position along another axis 
indicates identified sequence value (e.g., gene type). 

Preferably, prior to obtaining a ratio between the two 
library abundance values, the Cruncher operation also 
divides each ratio value by the total number of sequences 

20 in one or both of the target and subtractant libraries. 

The resulting lists of "relative" ratio values generated by 
the Cruncher operation are useful for many medical, 
scientific, and industrial applications. Also preferably, 
the output of the Cruncher operation is a set of lists, 

25 each list representing a sequence of decreasing ratio 
values for a different selected subset (e.g. protein 
family) of database entries. 

In one example, the abundance sort program of the 
invention tabulates for a library the numbers of mRNA 

30 transcripts corresponding to each gene identified in a 

database. These numbers are divided by the total number of 
clones sampled. The results of the division reflect the 
relative abundance of the mRNA transcripts in the cell type 
or tissue from which they were obtained. Obtaining this 

35 final data set is referred to herein as "gene transcript 
image analysis." The resulting subtracted data show 
exactly what proteins and genes are upregulated and 
downregulated in highly detailed complexity. 
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6.6. HUVEC CDNA LIBRARY 
Table 2 is an abundance table listing the various gene 
transcripts in an induced HUVEC library. The transcripts 
are listed in order of decreasing abundance. This 
5 computerized sorting simplifies analysis of the tissue and 
speeds identification of significant new proteins which are 
specific to this cell type. This type of endothelial cell 
lines tissues of the cardiovascular system, and the more 
that is known about its composition, particularly in 
10 response to activation, the more choices of protein targets 
become available to affect in treating disorders of this 
tissue, such as the highly prevalent atherosclerosis. 

6.7. MONOCYTE-CELL AND MAST-CELL cDNA LIBRARIES 
Tables 3 and 4 show truncated comparisons of two 

15 libraries. In Tables 3 and 4 the "normal monocytes" are 
the HMC-1 cells, and the "activated macrophages" are the 
THP-1 cells pretreated with PMA and activated with LPS. 
Table 3 lists in descending order of abundance the most 
abundant gene transcripts for both cell types. With only 

20 15 gene transcripts from each cell type, this table permits 
quick, qualitative comparison of the most common 
transcripts. This abundance sort, with its convenient 
side-by-side display, provides an immediately useful 
research tool. In this example, this research tool 

25 discloses that 1) only one of the top 15 activated 
macrophage transcripts is found in the top 15 normal 
monocyte gene transcripts (poly A binding protein); and 2) 
a new gene transcript (previously unreported in other 
databases) is relatively highly represented in activated 

30 macrophages but is not similarly prominent in normal 

macrophages. Such a research tool provides researchers 
with a short-cut to new proteins, such as receptors, cell- 
surface and intracellular signalling molecules, which can 
serve as drug targets in commercial drug screening 

35 programs. Such a tool could save considerable time over 
that consumed by a hit and miss discovery program aimed at 
identifying important proteins in and around cells, because 
those proteins carrying out everyday cellular functions and 
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represented as steady state mRNA are quickly eliminated 
from further characterization. 

This illustrates how the gene transcript profiles 
change with altered cellular function. Those skilled in 
5 the art know that the biochemical composition of cells also 
changes with other functional changes such as cancer, 
including cancer's various stages , and exposure to 
toxicity. A gene transcript subtraction profile such as in 
Table 3 is useful as a first screening tool for such gene 
10 expression and protein studies. 

6.8. SUBTRACTION ANALYSIS OP NORMAL MONOCYTE-CELL AND 
ACTIVATED MONOCYTE CELL cDNA LIBRARIES 

Once the cDNA data are in the computer, the computer 

program as disclosed in Table 5 was used to obtain ratios 

15 of all the gene transcripts in the two libraries discussed 
in Example 6.7, and the gene transcripts were sorted by the 
descending values of their ratios. If a gene transcript is 
not represented in one library, that gene transcript's 
abundance is unknown but appears to be less than 1. As an 

20 approximation — and to obtain a ratio, which would not be 
possible if the unrepresented gene were given an abundance 
of zero — genes which are represented in only one of the 
two libraries are assigned an abundance of 1/2. Using 1/2 
for unrepresented clones increases the relative importance 

25 of "turned-on" and "turned-of f " genes, whose products would 
be drug candidates. The resulting print-out is called a 
subtraction table and is an extremely valuable screening 
method, as is shown by the following data. 

Table 4 is a subtraction table, in which the normal 

30 monocyte library was electronically "subtracted" from the 
activated macrophage library. This table highlights most 
effectively the changes in abundance of the gene 
transcripts by activation of macrophages. Even among the 
first 20 gene transcripts listed, there are several unknown 

35 gene transcripts. Thus, electronic subtraction is a useful 
tool with which to assist researchers in identifying much 
more quickly the basic biochemical changes between two cell 
types. Such a tool can saye universities and 
pharmaceutical companies which spend billions of dollars on 
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research valuable time and laboratory resources at the 
early discovery stage and can speed up the drug development 
cycle, which in turn permits researchers to set up drug 
screening programs much earlier. Thus, this research tool 
5 provides a way to get new drugs to the public faster and 
more economically. 

Also, such a subtraction table can be obtained for 
patient diagnosis. An individual patient sample (such as 
monocytes obtained from a biopsy or blood sample) can be 
10 compared with data provided herein to diagnose conditions 
associated with macrophage activation. 

Table 4 uncovered many new gene transcripts (labeled 
Incyte clones). Note that many genes are turned on in the 
activated macrophage (i.e., the monocyte had a 0 in the 
15 bgf reg column) . This screening method is superior to other 
screening techniques, such as the western blot, which are 
incapable of uncovering such a multitude of discrete new 
gene transcripts. 

The subtraction-screening technique has also uncovered 
20 a high number of cancer gene transcripts (oncogenes rho, 
ETS2, rab-2 ras, YPTl-related, and acute myeloid leukemia 
mRNA) in the activated macrophage. These transcripts may 
be attributed to the use of immortalized cell lines and are 
inherently interesting for that reason. This screening 
25 technique offers a detailed picture of upregulated 

transcripts including oncogenes, which helps explain why 
anti-cancer drugs interfere with the patient's immunity 
mediated by activated macrophages. Armed with knowledge 
gained from this screening method, those skilled in the art 
30 can set up more targeted, more effective drug screening 
programs to identify drugs which are differentially 
effective against 1) both relevant cancers and activated 
macrophage conditions with the same gene transcript 
profile; 2) cancer alone; and 3) activated macrophage 
35 conditions. 

Smooth muscle senescent protein (22 kd) was 
upregulated in the activated macrophage, which indicates 
that it is a candidate to block in controlling 
inflammation. 
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6.9. SUBTRACTION ANALYSIS OF NORMAL LIVER CELLS AND 
HEPATITIS INFECTED LIVER CELL cDNA LIBRARIES 

In this example, rats are exposed to hepatitis virus 

and maintained in the colony until they show definite signs 

5 of hepatitis. Of the rats diagnosed with hepatitis, one 

half of the rats are treated with a new anti-hepatitis 

agent (AHA) . Liver samples are obtained from all rats 

before exposure to the hepatitis virus and at the end of 

AHA treatment or no treatment. In addition, liver samples 

10 can be obtained from rats with hepatitis just prior to AHA 

treatment. 

The liver tissue is treated as described in Examples 
6.2 and 6.3 to obtain mRNA and subsequently to sequence 
cDNA. The cDNA from each sample are processed and analyzed 

15 for abundance according to the computer program in Table 5. 
The resulting gene transcript images of the cDNA provide 
detailed pictures of the baseline (control) for each animal 
and of the infected and/or treated state of the animals. 
cDNA data for a group of samples can be combined into a 

20 group summary gene transcript profile for all control 
samples, all samples from infected rats and all samples 
from AHA-treated rats. 

Subtractions are performed between appropriate 
individual libraries and the grouped libraries. For 

25 individual animals, control and post-study samples can be 
subtracted. Also, if samples are obtained before and after 
AHA treatment, that data from individual animals and 
treatment groups can be subtracted. In addition, the data 
for all control samples can be pooled and averaged. The 

30 control average can be subtracted from averages of both 
post-study AHA and post-study non-AHA cDNA samples. If 
pre- and post-treatment samples are available, pre- and 
post-treatment samples can be compared individually (or 
electronically averaged) and subtracted. 

35 These subtraction tables are used in two general ways. 

First, the differences are analyzed for gene transcripts 
which are associated with continuing hepatic deterioration 
or healing. The subtraction tables are tools to isolate 
the effects of the drug treatment from the underlying basic 

40 pathology of hepatitis. Because hepatitis affects many 
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parameters, additional liver toxicity has been difficult to 
detect with only blood tests for the usual enzymes. The 
gene transcript profile and subtraction provides a much 
more complex biochemical picture which researchers have 
5 needed to analyze such difficult problems. 

Second, the subtraction tables provide a tool for 
identifying clinical markers, individual proteins or other 
biochemical determinants which are used to predict and/or 
evaluate a clinical endpoint, such as disease, improvement 

10 due to the drug, and even additional pathology due to the 
drug. The subtraction tables specifically highlight genes 
which are turned on or off. Thus, the subtraction tables 
provide a first screen for a set of gene transcript 
candidates for use as clinical markers. Subsequently , 

15 electronic subtractions of additional cell and tissue 

libraries reveal which of the potential markers are in fact 
found in different cell and tissue libraries. Candidate 
gene transcripts found in additional libraries are removed 
from the set of potential clinical markers. Then, tests of 

20 blood or other relevant samples which are known to lack and 
have the relevant condition are compared to validate the 
selection of the clinical marker. In this method, the 
particular physiologic function of the protein transcript 
need not be determined to qualify the gene transcript as a 

25 clinical marker. 

6.10. ELECTRONIC NORTHERN BLOT 
One limitation of electronic subtraction is that it is 
difficult to compare more than a pair of images at once. 
Once particular individual gene products are identified as 

30 relevant to further study (via electronic subtraction or 
other methods) , it is useful to study the expression of 
single genes in a multitude of different tissues. In the 
lab, the technique of "Northern" blot hybridization is used 
for this purpose. In this technique, a single cDNA, or a 

35 probe corresponding thereto, is labeled and then hybridized 
against a blot containing RNA samples prepared from a 
multitude of tissues or cell types. Upon autoradiography, 
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the pattern of expression of that particular gene, one at a 
time, can be quantitated in all the included samples. 

In contrast, a further embodiment of this invention is. 
the computerized form of this process, termed here 
5 "electronic northern blot." In this variation, a single 
gene is queried for expression against a multitude of 
prepared and sequenced libraries present within the 
database. In this way, the pattern of expression of any 
single candidate gene can be examined instantaneously and 

10 effortlessly. More candidate genes can thus be scanned, 
leading to more frequent and fruitfully relevant 
discoveries. The computer program included as Table 5 
includes a program for performing this function, and Table 
6 is a partial listing of entries of the database used in 

15 the electronic northern blot analysis. 

6.11. PHASE I CLINICAL TRIALS 
Based on the establishment of safety and effectiveness 
in the above animal tests, Phase I clinical tests are 
undertaken. Normal patients are subjected to the usual 

20 preliminary clinical laboratory tests. In addition, 
appropriate specimens are taken and subjected to gene 
transcript analysis. Additional patient specimens are 
taken at predetermined intervals during the test. The 
specimens are subjected to gene transcript analysis as 

25 described above. In addition, the gene transcript changes 
noted in the earlier rat toxicity study are carefully 
evaluated as clinical markers in the followed patients. 
Changes in the gene transcript analyses are evaluated as 
indicators of toxicity by correlation with clinical signs 

30 and symptoms and other laboratory results. In addition, 
subtraction is performed on individual patient specimens 
and on averaged patient specimens. The subtraction 
analysis highlights any toxicological changes in the 
treated patients. This is a highly refined determinant of 

35 toxicity. The subtraction method also annotates clinical 
markers. Further subgroups can be analyzed by subtraction 
analysis, including, for example, 1) segregation by 
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occurrence and type of adverse effect; and 2) segregation 
by dosage. 

6,12. GENE TRANSCRIPT IMAGING ANALYSIS IN CLINICAL STUDIES 
A gene transcript imaging analysis (or multiple gene 
5 transcript imaging analyses) is a useful tool in other 
clinical studies. For example, the differences in gene 
transcript imaging analyses before and after treatment can 
be assessed for patients on placebo and drug treatment. 
This method also effectively screens for clinical markers 
10 to follow in clinical use of the drug. 

€•13. COMPARATIVE GENE TRANSCRIPT ANALYSIS BETWEEN SPECIES 

The subtraction method can be used to screen cDNA 
libraries from diverse sources. For example, the same cell 
types from different species can be compared by gene 

15 transcript analysis to screen for specific differences, 
such as in detoxification enzyme systems. Such testing 
aids in the selection and validation of an animal model for 
the commercial purpose of drug screening or toxicological 
testing of drugs intended for human or animal use. When 

20 the comparison between animals of different species is 

shown in columns for each species, we refer to this as an 
interspecies comparison, or zoo blot. 

Embodiments of this invention may employ databases 
such as those written using the FoxBASE programming 

25 language commercially available from Microsoft Corporation. 
Other embodiments of the invention employ other databases, 
such as a random peptide database, a polymer database, a 
synthetic oligomer database, or a oligonucleotide database 
of the type described in U.S. Patent 5,270,170, issued 

30 December 14, 1993 to Cull, et al., PCT International 

Application Publication No. WO 9322684, published November 
11, 1993, PCT International Application Publication No. WO 
9306121, published April 1, 1993, or PCT International 
Application Publication No. WO 9119818, published December 

35 26, 1991. These four references (whose text is 

incorporated herein by reference) include teaching which 
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may be applied in implementing such other embodiments of 
the present invention. 

All references referred to in the preceding text are 
hereby expressly incorporated by reference herein. 
5 Various modifications and variations of the described 

method and system of the invention will be apparent to 
those skilled in the art without departing from the scope 
and spirit of the invention. Although the invention has 
been described in connection with specific preferred 
10 embodiments, it should be understood that the invention as 
claimed should not be unduly limited to such specific 
embodiments. 
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TABLE 2 



Clone numbers 15000 through 20000 

Libraries : HUVEC 

Arranged by ABUNDANCE 

Total clones analyzed: 5000 

319 genes, for a total of 1713 Clones 





number 


N 


c 


entry 


1 


15365 


67 




HSRPL41 


2 


15004 


65 




NCY015004 


3 


15638 


63 




NCY015638 


4 


15390 


50 




NCY015390 


5 


15193 


47 




HSFIB1 


6 


15220 


47 




RRRPL9 


7 


15280 


47 




NCY015280 


8 


15583 


33 




M62060 


9 


15662 


31 




HSACTCGR 


10 


15026 


29 




NCY015026 


11 


15279 


24 




HSEF1AR 


12 


15027 


23 




NCY015027 


13 


15033 


20 




NCY015033 


14 


15198 


20 




NCY015198 


15 


15809 


20 




HSCOLL1 


16 


15221 


19 




NCY015221 


17 


15263 


19 




NCY015263 


18 


15290 


19 




NCY015290 


19 


15350 


18 




NCY015350 


20 


15030 


17 




NCY015030 


21 


15234 


17 




NCY015234 


22 


15459 


16 




NCY015459 


23 


15353 


15 




NCY015353 


24 


15378 


15 




S76965 


25 


15255 


14 




HUMTHYB4 


26 


15401 


14 




HSLIPCR 


27 


15425 


14 




HSPOLYAB 


*3 O 


18212 


14 




HUMTHYMA 


O Q 




14 




HSMRP1 


on 
oU 


1 C 1 Oft 

15189 


13 




HS18D 




1 C ft ^ 1 

ISO 31 


12 




HUMFKBP 






12 




HSH2AZ 


Jo 




12 




HUMLEC 


34 


15789 


11 




NCY015789 


35 


16578 


11 




HSRPS11 


36 


16632 


11 




M61984 


37 


18314 


11 




NCY018314 


38 


15367 


10 




NCY015367 


39 


15415 


10 




HSIFNIN1 


40 


15633 


10 




HSLDHAR 


41 


15813 


10 




CHKNMHCB 


42 


18210 


10 




NCY018210 


43 


18233 


10 




HSRPII140 


44 


18996 


10 




NCY018996 


45 


15088 


9 




HUMFERL 


46 


15714 


9 




NCY015714 


47 


15720 


9 




NCY015720 


48 


15863 


9 




NCY015863 


49 


16121 


9 




HSET 


50 


18252 


9 




NCY018252 


51 


15351 


8 




HUMALBP 


52 


15370 


8 




NCY015370 



descriptor 

Riboptn L41 
INCYTE 015004 
INCYTE 015638 
INCYTE 015390 
Fibronectin 
Riboptn L9 
INCYTE 015280 
EST HHCH09 (IGR) 
Actin, gamma . 
INCYTE 015026 
Elf 1-alpha 
INCYTE 015027 
INCYTE 015033 
INCYTE 015198 
Collagenase 
INCYTE 015221 
INCYTE 015263 
INCYTE 015290 
INCYTE 015350 
INCYTE 015030 
INCYTE 015234 
INCYTE 015459 
INCYTE 015353 
Ptn kinase inhib 
Thymosin beta-4 
Lipocortin I 
Poly-A bp 
Thymosin, alpha 

Motility relat ptn; MRP-l;CD-9 

Interferon indue ptn 1-8D 

FK506 bp 

Histone H2A 

Lectin, B-galbp, 14kDa 

INCYTE 015789 

Riboptn Sll 

EST HHCA13 (IGR) 

INCYTE 018314 

INCYTE 015367 

interferon indue mRNA 

Lactate dehydrogenase 

C Myosin heavy chain B 

INCYTE 018210 

RNA polymerase II 

INCYTE 018996 

Ferritin, light chain 

INCYTE 015714 

INCYTE 015720 

INCYTE 015863 

Endothelin 

INCYTE 018252 

Lipid bp, adipocyte 

INCYTE 015370 
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TABLE 2 Con't 





number 


N 


53 


15670 


8 


54 


15795 


8 


55 


16245 


8 


56 


18262 


8 


57 


18321 


8 


58 


15126 


7 


59 


15133 


7 


60 


15245 


7 


61 


15288 


7 


62 


15294 


7 


63 


15442 


7 


64 


15485 


7 


65 


16646 


7 


66 


18003 


7 


67 


15032 


6 


68 


15267 


6 


69 


15295 


6 


70 


15458 


6 


71 


15832 


6 


72 


15928 


6 


73 


16598 


6 


74 


18218 


6 


75 


18499 


6 


76 


18963 


6 


77 


18997 


6 


78 


15432 


5 


79 


15475 


5 


80 


15721 


5 


81 


15865 


5 


82 


16270 


5 


83 


16886 


5 


84 


18500 


5 


85 


18503 


5 


86 


19672 


5 


87 


15086 


4 


88 


15113 


4 


89 


15242 


4 


90 


15249 


4 


91 


15377 


4 


92 


15407 


4 


93 


15473 


4 


94 


15588 


4 


95 


15684 


4 


96 


15782 


4 


97 


15916 


4 


98 


15930 


4 


99 


16108 


4 


100 


16133 


4 



entry s descriptor 

BTCIASHI V NADH-ubiq oxidoreductase 

NCY015795 INCYTE 015795 

NCY016245 INCYTE 016245 

NCY018262 INCYTE 018262 

HSRPL17 Riboptn L17 

XLRPL1BRF Riboptn LI 

HSAC07 Actin, beta 

NCY015245 INCYTE 015245 

NCY015288 INCYTE 015288 

HSGAPDR G-3-PD 

HUMLAMB Laminin receptor, 54kDa 

HSNGMRNA Uracil DNA glycosylase 

NCY016646 INCYTE 016646 

HUMPAIA Plsmnogen activ gene 

HUMUB Ubiquitin 

HSRPS8 Riboptn S8 

NCY015295 INCYTE 015295 

RNRPS10R R Riboptn S10 

RSGALEM R UDP-galactose epimerase 

HUMAPOJ Apolipoptn J 

HUMTBBM40 Tubulin, beta 

NCY018218 INCYTE 018218 

HSP27 Hydrophobic ptn p27 

NCY018963 INCYTE 018963 

NCY018997 INCYTE 018997 

H SAG ALAR Galactosidase A, alpha 

NCY015475 INCYTE 015475 

NCY015721 INCYTE 015721 

NCY015865 INCYTE 015865 

NCY016270 INCYTE 016270 

NCY016886 INCYTE 016886 

NCY018500 INCYTE 018500 

NCY018503 INCYTE 018503 

RRRPL34 R Riboptn L34 

XLRPL1AR F Riboptn Lla 

HUMIFNWRS tRNA synthetase, trp 

NCY015242 INCYTE 015242 

NCY015249 INCYTE 015249 

NCY015377 INCYTE 015377 

NCY015407 INCYTE 015407 

NCY015473 INCYTE 015473 

HSRPS12 Riboptn S12 

HSEF1G Elf 1-gamma 

NCY015782 INCYTE 015782 

HSRPS18 Riboptn S18 

NCY015930 INCYTE 015930 

NCY016108 INCYTE 016108 

NCY016133 INCYTE 016133 
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TABLE 4 



Libraries: THP-1 

Subtracting: HMC 

Sorted by ABUNDANCE 

Total clones analyzed: 7375 



1057 genes, for a total of 2151 clones 



number 


entry 


s descriptor 


bgf req 


rf end 


ratio 


10022 


HUMIL1 


IL 1-beta 


0 


131 


262.00 


10036 


HSMDNCF 


IL-8 


0 


119 


238.00 


10089 


HSLAG1CDN 


Lymphocyte activ gene 


0 


71 


142.00 


10060 


HUMTCSM 


RANTES 


0 


23 


46.000 


10003 


HUMMIP1A 


MIP-1 


3 


121 


40.333 


10689 


HSOP 


Osteopontin 


0 


20 


40.000 


11050 


NCY011050 


INCYTE 011050 


0 


17 


34.000 


10937 


HSTNFR 


TNF-alpha 


0 


17 


34.000 


10176 


HSSOD 


Superoxide dismutase 


0 


14 


28.000 


10886 


HSCDW40 


B-cell activ, NGF-relat 


0 


10 


20.000 


10186 


HUMAPR 


Early resp PMA- indue 


0 


9 


18.000 


10967 


HUMGDN 


PN-1, glial-deriv 


0 


9 


18.000 


11353 


NCY011353 


INCYTE 011353 


0 


8 


16.000 


10298 


NCY010298 


INCYTE 010298 


0 


7 


14.000 


10215 


HUM 4 COLA 


Collagenase, type IV 


0 


6 


12.000 


10276 


NCY010276 


INCYTE 010276 


0 


6 


12-000 


10488 


NCY010488 


INCYTE 010488 


0 


6 


12.000 


11138 


NCY011138 


INCYTE 011138 


0 


6 


12.000 


10037 


HUMCAPPRO 


Adenylate cyclase 


1 


10 


10.000 


10840 


HUMADCY 


Adenylate cyclase 


0 


5 


10.000 


10672 


HSCD44E 


Cell adhesion glptn 


0 


5 


10.000 


12837 


HUMCYCLOX 


Cyclooxygenase-2 


0 


5 


10.000 


10001 


NCY010001 


INCYTE 010001 


0 


5 


10.000 


10005 


NCY010005 


INCYTE 010005 


0 


5 


10.000 


10294 


NCY010294 


INCYTE 010294 


0 


5 


10.000 


10297 


NCY010297 


INCYTE 010297 


0 


5 


10.000 


10403 


NCY010403 


INCYTE 010403 


0 


5 


10.000 


10699 


NCY010699 


INCYTE 010699 


0 


5 


10.000 


10966 


NCY010966 


INCYTE 010966 


0 


5 


10.000 


12092 


NCY012092 


INCYTE 012092 


0 


5 


10.000 


12549 


HSRHOB 


Oncogene rho 


0 


5 


10.000 


10691 


HUMARF1BA 


ADP-ribosylation fctr 


0 


4 


8.000 


12106 


HSADSS 


Adenylosuccinate synthetase 


0 


4 


8.000 


10194 


HSCATHL 


Cathepsin L 


0 


4 


8.000 


10479 


CLMCYCA 


I Cyclin A 


0 


4 


8.000 


10031 


NCY010031 


INCYTE 010031 


0 


4 


8.000 


10203 


NCY010203 


INCYTE 010203 


0 


4 


8.000 


10288 


NCY010288 


INCYTE 010288 


0 


4 


8.000 


10372 


NCY010372 


INCYTE 010372 


0 


4 


8.000 


10471 


NCY010471 


INCYTE 010471 


0 


4 


8.000 


10484 


NCY010484 


INCYTE 010484 


0 


4 


8.000 


10859 


NCY010859 


INCYTE 010859 


0 


4 


8.000 


10890 


NCY010890 


INCYTE 010890 


0 


4 


8.000 


11511 


NCY011511 


INCYTE 011511 


0 


4 


8.000 


11868 


NCY011868 


INCYTE 011868 


0 


4 


8.000 


12820 


NCY012820 


INCYTE 012820 


0 


4 


8.000 


10133 


HSI1RAP 


IL-1 antagonist 


0 


4 


8.000 


10516 


HUMP2A 


Phosphatase, regul 2A 


0 


4 


8.000 


11063 


HUMB94 


TNF-induc response 


0 


4 


8.000 


11140 


HSHB15RNA 


HB15 gene; new Ig 


0 


3 


6.000 


10788 


NCY001713 


INCYTE 001713 


0 


3 


6.000 


10033 


NCY010033 


INCYTE 010033 


0 


3 


6.000 


10035 


NCY010035 


INCYTE 010035 


0 


3 


6.000 


10084 


NCY010084 


INCYTE 010084 


0 


3 


6.000 


10236 


NCY010236 


INCYTE 010236 


0 


3 


6.000 


10383 


NCY010383 


INCYTE 010383 


0 


3 


6.000 
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TABLE 4 Con't 



number 


entry s 


descriptor 


bgfreq 


rf end 


ratio 


10450 


NCY010450 


INCYTE 


010450 


0 


3 


6.000 


10470 


NCY010470 


INCYTE 


*S 1 A IMA 

010470 


0 


3 


6.000 


10504 


NCY010504 


INCYTE 


010504 


0 


3 


6.000 


10507 


NCY010507 


INCYTE 


010507 


0 


3 


6.000 


10598 


NCY010598 


INCYTE 


010598 


0 


3 


6.000 


10779 


NCY010779 


INCYTE 


010779 


0 


3 


6.000 


10909 


NCY010909 


INCYTE 


010909 


0 


3 


6.000 


10976 


NCY010976 


INCYTE 


010976 


0 


3 


6.000 


10985 


NCY010985 


INCYTE 


010985 


0 


3 


6.000 


11052 


NCY011052 


INCYTE 


011052 


0 


.3 


6.000 


11068 


NCY011068 


INCYTE 


011068 


0 


3 


6.000 


11134 


NCY011134 


INCYTE 


011134 


0 


3 


6.000 


11136 


NCY011136 


INCYTE 


011136 


0 


3 


6.000 


11191 


NCY011191 


INCYTE 


011191 


0 


3 


6.000 


11219 


NCY011219 


INCYTE 


011219 


0 


3 


6.000 


11386 


NCY011386 


INCYTE 


011386 


0 


3 


6.000 


11403 


NCY011403 


INCYTE 


011403 


0 


3 


6.000 


11460 


NCY011460 


INCYTE 


011460 


0 


3 


6.000 


11618 


NCY011618 


INCYTE 


011618 


0 


3 


6.000 


11686 


NCY011686 


INCYTE 


011686 


0 


3 


6.000 


12021 


NCY012021 


INCYTE 


012021 


0 


3 


6.000 


12025 


NCY012025 


INCYTE 


012025 


0 


3 


6.000 


12320 


NCY012320 


INCYTE 


012320 


0 


3 


6.000 


12330 


NCY012330 


INCYTE 


012330 


0 


3 


6.000 


12853 


NCY012853 


INCYTE 


012853 


0 


3 


6.000 


14386 


NCY014386 


INCYTE 


014386 


0 


3 


6.000 


14391 


NCY014391 


INCYTE 


014391 


0 


3 


6.000 
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TABLE 5 



* Master menu for SUBTRACTION output 
SUV TALK OFF 

SET SAFETY OFF 
SET EXACT ON 
SET TYPEAHEAD TO 0 
CLEAR ' 

SET DEVICE TO SCREEN 

USE-"SmartGuyiFoXBASE+/Mac;fcx files s Clones, dbf" 
go top 4 

STORE NUMBER TO INITIATE 

GO BUl'lUM ' 

STORE NUMBER TO 'TERMINATE 
STORE ' ■* TO Targetl 

STORE 1 ' TO Target2 

STORE • • ' TO Target3 

STORE. 1 1 TO Object 1 

STORE ■ 1 1 TO Object 2 

STORE ' 1 TO Object 3 

STORE 0 TO ANAL ' 
STORE 0 TO EMATCH 
STORE 0 TO HMATCH 
STORE 0 TO GMATCH 
STORE 0 TO IMATCH 
S-iUKJS 0 TO PTP 
STORE 1 TO BAXIs 
DO WHILE .T. 

* Program. i 'Subtraction 2.£ntt 

'• Date.,,. j, 10/11/94 • ... 

■ * Version? i Fo*BASE+/Mac, revision 1*10 

* Notes.... t Fornafc file Subtraction 2 
•* 

SCREEN 1 TYPE 0 HEADING "Screen l a AT 40,2 SIZE 286,492 PIXELS FONT -Geneva', 9 COLOR 0,0,0, 
6 PIXELS 75,120 TO 178,241 STOLE 3871 COLOR 0,0,-1,24610,-1,6947 

8 PIXELS 27,134 SAY -Subtraction Menu - STOLE 65536 FOOT "Geneva*, 274 COLOR 0,0,-1,-1,-1,-1 
6 PIXELS. 117,126 GET B1ATCH STYLE 65536 FONT 'ChicagoM2 PICTURE '6*C Exact ' SI2E '15;62 'CO 
6 'PIXELS 135,.126 GET HMATCH 'STYLE 65536 FONT -Chicago', 12 .PICTURE '8«C Homologous '• SIZE .15,1 
8 PIXELS 153,126 GET GMATCH STYLE 65536 FONT 'Chicago', 12 PICTURE '9*C Other epe" SIZE 15,84 
e PIXELS 90,152 SAY "Matches t ■. STYLE 65535 FONT •Geneva 1 , 12 COLOR 0,0,rlr-l, -1,-1 
e PIXELS 171,126 GST Imatch STYLE 65536 FONT - Chicago M2 PICTURE »G*C Incyte 1 SIZE 15,65 CO 
<? PIXELS 252,137 GET initiate STYLE 0 FONT 'Geneva", 12 SIZE 15,70 COLOR 0,0,-1, -1,-1,-1 
a PIXELS 252,236 GET terminate STYLE 0 FONT 'Geneva-, 12 SIZE 15,70 COLOR 0,0,-1,-1,-1,-1 
8 PIXELS 252,35 SAY -Include clones- * STYLE 65536 FOOT -Geneva-, 12 COLOR 0,0,-1,-1,-1, -1 
0 FIXELS- 252,215 SAY "->' STYLE' 65536 FONT 'Geneva', 14 COLOR 0,0,-1,-1,-1,-1 . 
6 PIXELS -198,126 GET PTF STYIfi 65536 FOOT 'Chidago-,12 PICTURE "8*C .Print to file" SIZE 15', 9 
G'PIXELS 90,9 TO 151,109 STYl£ 3871 COLOR 0,0,-1,-25600,-1,-1 
8 PIXELS 90,28'S TO '191, 397 STYLE 3871 COLOR 0,0,-1,-25600,-1,-1 

6 PIXELS 81,296 SAY 'Background:' STYLE 65536 FONT "Geneva",270 COLOR 0,0,-1,-1,-1,-1 
6 PIXELS 45,135 GET ANAL STYLE 65536 FONT 'Chicago" ,.12 PICTURE "0*R Overall > Function"' SIZE 4 
8 PIXELS 81,56 SAY "Target:" STYLE 65536 FONT 'Geneva', 270 COLOR 0,0,-1,-1,-1,-1 
8 PIXELS 108,20 GET target! STYLE 0 PCMP 'GeneVA",9 SIZE 12,79 COLOR 0,0,^1,-1,-1,-1 
•0 PIXELS 13S,20 GET targets STYLE 0 FONT -Geneva', 9 SIZE 12,79 COLOR 0,0,-1,-1,-1,-1 
.8 PIXELS 162,20 GET target3 STYLE 0 FONT "Geneva"^ SIZE 12,79 COLOR 0,0,-1,-1,-1,-1 
8 PIXELS 108,299 GET objectl STYLE 0 FOOT 'Geneva*, 9 size 12,79-COLOR 0,0,-1,-1,-1,-1 
8 PIXELS 135,299 GET object2 STYLE 0. FONT "Geneva-, 9 8IZE 12,79 COLOR 0,0,-1,-1,-1,-1 
8 PIXELS 162,299 GET objects STYLE 0 FONT -Geneva', 9 9IZE 12,79 COLOR 0,0,-1,-1,-1,-1 
'8 PIXELS 276,324'GET Bail STYLE $5536 FONT -Chicago-, 12 PICTURE "8+& Run/Sail cut" SIZE 4112 
* 

* EOF: Subtraction. 2. fmt 
READ • 

IP Bail*2 
CLEAR 

CLOSE DATABASES 

USE «SmartGuy;FoxBASE+/Maci£o>c files : clones. dbf" 
.SET SAFETY ON 
SCREEN. 1 OFF 
RETURN 
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ENDIF 

STOfcE VAL(SYS(2) ) TO STARTIM3 
STORE UPPER (Target*). TO Target 1 
STORE. UPPER (Target 2) TO Targets 
STORE UPPER (Target3) TO Targets 
ST0H3 UPPSR(Qbjectl) TQ Objectl 
STORE UPPER (Object 2) TO- 0bject2 
STORE UPPER(Object3) TO Object 3 
clear 

SET TflJC ON 

GAP s TERMINATE- INITIATE+1 
GO INITIATE 

COPY NECT GAP FIELDS NUMBER, library ,D,F, 2 ,R, ENm,S,pESCRIPTOR, START, RFEND,I TO TEMPNUH 
USE TMPNUM 
COUNT TO TOT 

COPY TO TEMPRED FOR D='E' ,0R.D='O» .OR.D^H'^OR.Ds'N' .OR.D»'I' 
USE TEMPRED 

IF Bnatch=0 .AND, ttnatch«0 .AND. Cmatch»0 .AND. IMATCH=K) 

COPY TO TCdFDESIG 

ELSE 

COPY STRUCTURE TO TEMPDESIG 
USE TEMPDESIG 
XF Boatch»l 

APPEND FROM TEMPNUM FOR D^c'E 1 
ENDIF 

IF*Hmateh=l 

APPEND FROM TEMENUM FOR P^'K' 
ENDIF 

IF tniatchsl 

APPEND FRt%' TEMPNUM FOR Ds'O' 
IF Irratchsl 

APPEND FROM TEMPNUM FOR D= 'I » .OR.D= 'X 1 
*,OR,D«»N' 

.ENDIF 
ENDIF 

COUNT TO STARTOT 

COPY STRUCTURE TO TEMPLIB 
.USE TEMFLXB ... 

append from tempdesig FOR librarymUFPER (targetl ) 

IP target2<> 1 . • 

APPEND "FROM TEMPDESIG FOR library =U??ER< target 2) 

aroiF * 

IF target3<>» « . 

APPEND FROM. TEHPDESIG FOR library-UPPER (targets ) 
ENDIF 
COUNT TO ANAI/TOT 

USE T&1PDZSIG 

COPT STRUCTURE TO TEKPSUB 

USE TEMPSUB 

APPEND FROM TEMPDESIG FOR librniy»UPPER (Objectl) 
IP ta^get2o' 1 

APPEND FROM TEMPDESIG FOR. libra ry'cUFFER (Obj ect2 ) 
ENDIF 

IF target3o' 

-APPEND FROM TEMPDESIG FOR library=UPPER (Object 3 ) 

ENDIF 
COUNT TO SUBTRACTOT 
SET TALK OFF 



* COMPRESSION SUBROUTINE A 
? 'COMPRESSING' QUERY LIBRARY 1 
USE TIMPLIB 
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SORT CN* ENTRY r NUMBER 00 LIBSORT 

USE LIBSORT 

COUNT TO IDGENE 

REPLACE ALL RFEND WITH 1 

MARKl o 1 

CO WHILE SW2-0 ROLL 
IF MARKl >= IDGENE 
PACK 

COUNT TO ADNIQUE 

SW2=1 

LOOP 

ENDIF 
60 MARKl 
COT 5= 1 

STORE ENTRY TO TESTA 
STORE D TO DESIGA . 
SW - 0 

'DO WHILE SW=0 .TEST 

sap 

8T0RE TO TESTS 

STORE D TO B2SIGB 

IF TESTA « TESTB.WD,DS6IGA»DESI<3B 

DELETE 

SUP = DUP+1 

LOOP 

ENDIF 
GO MARKl 

REPLACE RFEND WTEtf EOT 
HAHKl m MARiU+DUP 
SW=1 
LOOP 

ENDDO. TEST 
LOOP 

ENDDO ROLL 

SORT CN RFE^/D,KUMBt!R TO TQtPtfARSORT . 
USE TEMPT7VRSORT 

* REPLACE ALL START WITO RPEMD/UXENS*10000 ' 
COUNT TO TEKPTARCO 

♦ CC&IPRiSSICM SUBROUTINE B 

? , CO^PPESSIKG TARGET LIBRARY 1 
USE .TEKFSUB 

SORT ON ENTRY, NUMBER TO'SUBSORT 
USE SUBSDRT 

COUNT TO SUSGENE 

REPLACE ALL RFEND WITH 1 

MARX1 b 1 

SW2rQ 

DO WHILE SW2=0 ROLL 
IF KAHKL >s SUEGEWE 
PACK . 

COUNT TO BUNIQUE 

SW2=1 

LOOP 

ENDIF 
GO MARKl • 
DUP - 1 

STORE, ENTRY TO TESTA 
STORE D TO DESIGA 
SW » 0 

DO WHILE 6W=0 TEST 
SKIP 

STORE ENTRY TO. TESTB ■ ■ 

STORE D TO DESIGB 

IF TESTA = TESTB . AND. DESIGteDESIGB 
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DELETE 
DUP » DUP+1 
LOOP 
DIDIF ■ 
GO MARK1 

REPLACE RFEND WITH EOT 
MARX! » MARK1+DUP 

SHbl 

IXDOP 

ENDDO TEST 
JjOOP ; 
enbdo roll 

sort on rfend/d, number to tekpsubsokt 
/USE TEMPSUBSQRT 
* REPLACE ALL START WITH WtND/IDGZNE*10000 
COUNT TO TEMPSUECO 



♦FUSION ROUTINE 

? 'SUBTRACTING LIBRARIES 1 

USE SUBTRACTION 

COPY STRUCTURE TO CRUNCHER 

SELECT 2 

USE T224PSUB SORT 

SELECT 1* 

USE CRUNCHER 

APPEND FROM TEMPTARSORT 

COUNT TO BAILOUT 

KARK s 0 

CO TOOLS 
SELECT 1 
MARK = MARK+1 

IF MARX>BAILOUT 

EXIT 

ENDIF 
•GO MARK 

STCEE" ENTRY TO SCANNER 
SELECT 2 

LOCATE. FOR £WIRY*SCANNER 
IP FOUND () 
STORE RFEND TO BIT1 
STORE RFEND TO BZT2 
ELSE ■ 

STOR E 1/2 TO BIT! 
STORE 0 TO BIT2 
BNDIF 
SELECT X 

REPLACE BGFRBO WITH BIT2 
REPLACE ACTUAL WITH BIT1 
LOOP 

qedo 

SELECT 1 

REPLACE ALL RATIO WITH RFEND/ACTOAL 

? 'DOING FINAL SORT BY RATIO 1 

SORT ON, RATIO/L r EGFREQ/D, DESCRIPTOR TO FINAL 

USE FINAL 



Bet balk off 

DO CASE. 

CASS PTFaO" 

SET DEVICE TO PRINT 

S ET PROTT ON 

CASE PTF=1 

SET ALTERNATE TO "Adenoid. Patent Figures : Subtraction . txt ■ 
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SET ALTERNATE CN 
ZNDGASE 

STORE VAL{6Y5(2))' TO F2NTIME 

IP FINTIM&cSTARTCME 

STORE FINTIMB+86400 TO .FXtiTCMB 

D3DIF 

STORE FINTIME 6TARTIME.TO CCMPSEC 
STORE CCMPSEC/ 60 TO CCMPM2N 



SET MARGIN TO 10 ^ 

01,1 BAY ■Iiihraxy Subtraction Analysis" STYLE 65536 PONT *Geneva\274 OOLOR 0,0,0,-1,-1, 
7 
7 
? 
? 

9 dated 
7? i • • 

?? TIME() 

7 (Clone nuxriberfi * 
77>CTR(EJITIA , IE f 5,0) 
,7?,' through 1 • ' 
?? STR (TERMINATE, 6,0} 
7 'Libraries t 1 
? Target! 
IP Target2<>' 
77. V 
7? Target? 
ENDXF 

IF Target3<>' 
7? ', • • 
77 Target3 
ENDI? 

? 'Subtracting; 
7 Objectl 
1F-Objcct2o' 
??• ' , . 1 
7? 0bject2 

ENDIF 

IF Qbject3<>' 
?7 ', 1 
7? Objects 
Efrifij t* 1 . 

7 ' Designations r .* 

IF Ematch=0 .AND, Hmatch=0 .AND. Cwatch=0 .AND. 1KATCK=0 
?? 'All' 
ENDIF 

IF Bnatch=l 
7? 'Exact, 1 
ENDIF 

IF ttnatchsl 
77 'Hunan,' 
ENDIF 

•IF omatehsl 
7? 'Other ep. f 
ENDIF 

IF Imatchol 
7? 'mCYTE' 
ENDIF 
/IF ANALsl 

7 'Sorted by ABUNDANCE 
ENDIF- 
IF ANRL«2 

7 'Arranged ty FUNCTION 1 
ENDIF 
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? 'Total donee represented! ■ 
?? 6TR(T0T,5,0} 
7 'Total -clones analyzed: 1 
?? STRISTARTOT, 5,0) 
? 'Total, confutation, timet 
.?? Sra{CCM5WXN,5,2) " 
?? 1 ainuteo' * 
? * * 

?''d b designation £ = distribution * = location, r c function s = species i s inte 



SCREEN 1 TYPE 0 HEADING "Screen 1* AT 40,2 SIZE 286,452 PIXELS FOOT 'Geneva ',9 COLOR 0,0,0, 
DO CASE . 
CASE ANAL*=1 

?? STR(AUNIQUE,4'0) 
7f 1 genes, for a total of 1 • 
..?? STR(ANAWOT,.4,0) ' 
7? 1 clones' 

? . 

SCREEN 1 TOPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT 'Geneva- ,7 COLOR 0,0,0, 
list OfcF fields n\miber,D,F # Z,RjaJraY, S, DESCRIPTOR, EGFR£Q,RFEHD, RATIO, I 
SET PRINT OFF' 
CLOSE DATABASES , 

•USE ( K £ntartCqy;FoXBASEt/MaC!fox files i clones, cDbf 

CASE.ANAL*2 
•* arrange/ function 
SET PRINT' CM 
SET HEADING CN 

SCREEN 1 TYPE. 0 HEADING 'Screen l'.AT 40,2 SIZE 286,492 PIXELS * FOOT ■Helvetica" , 268 COLOR 0 

? * ' 

? 1 BINDING PROTEINS 1 

7 . 

SCREEN* 1 TOPE 0 HEADING 'Screen 1? AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica* , 265 COLOR 0 
7 'Surface molecules and receptors i 1 

SCREEN 1 TOPE 0 HEADING 'Screen 1* AT 40,2 SIZE 286,492 PIXELS PONT 'Geneva", 7 COLOR 0,0,0, 
list OFF fields number, D,'F ; Z , R , ENTRY, 5 , DESCRIPTOR, BGFREQ/RFEND, RATIO , 1 FOR Rs'B' 

'SCREEN 1 TOPE 0 HEADING 'Screen 1* *AT 40,2 SIZE 286,492 PIXELS .FOOT 'Helvetica' , 265 COLOR 0 
? 'Calcium-binding proteins! * • • 

SCREEN 1 TOPE 0 HEADING "Screen 1' AT 40, 2 'SIZE 266,492 PIXELS FONT "Geneva*, 7 COLOR 0,0,0, 
list OFF fields nuntoex,D,F,Z,R,E2^TRY,S, DESCRIPTOR ,BGFREQ r RF£OT, RATIO, 3 FOR Rs'C 

SCREEN' 1 TOPE 0 HEADING 'Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica' ,265 COLOR 0 
? 'Ligands 'and ef factors t! 

SCREEN 1 TOPE 0 HEADING "Screen 1' AT 40,2 SIZE 286,492 PIXELS FONT "Geneva\7 COLOR 0,0,0, 
list OFF fields number, d,f,z,r, entry, s, descriptor, bgfreo, rfend* ratio, i for 5='S t 

SCREEN 1 TOPE 0 HEADING 'Screen 1 B AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica" ,265 COLOR 0 
? 'Other binding proteins t 1 

SCREEN 1 TOPE 0 HEADING 'Screen I - AT*40,2 SIZE 286,492 PEELS FONT "Geneva - , 7 COLOR 0,0,0, 
list OFF fields ' number , D , F , Z , R , DJTRY , £ , DESCRIPTOR , BSFREQ, RFQ3D, RATIO , I FOR Rs'I* ■ 
? • ■ 

SCREEN 1 TOPE 0 HEADING 'Screen 1' AT 40, 2 SIZE 286,492 PIXELS FONT 'Helvetica' ,268 COLOR 0 
7 1 . ONCOGENES 1 

' 

SCREEN 1 TOPE 0 HEADING "Screen l 1 AT 40,2 SIZE 266,492 PIXELS FONT ''Helvetica' , 265 COLOR 0 
? 'General oncogenes i 1 r , 

SCREEN 1 TOPE 0, HEADING '.Screen 1' AT .40,2 SIZE 286,492 PIXELS .FONT 'Geneva', 7 COLOR 0,0,0, 
list OFF fields number r D)F,Z,R,ENTRY,S f DESCRIPTOR r BGFREQ,RFEND, RATIO, I FOR Ro'O' 

SCREEN 1 TOPE 0 HEADING 'Screen 1' AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica" ,265 COLOR 0 
7 'OTP-binding proteins i 1 

SCREEN 1 TOPE 0 HEADING 'Screen 1' AT 40,2 SIZE 286,492 PIXELS FONT B Geneva*',7 COLOR 0,0,0, 
list OFF fields number, D< F , Z , R, ENTRY , S , DESCRIPTOR , BGPREQ , RFEND, RATIO , I FOR Rs'G' 
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SCREEN 1 TYPE 0 HEADING -Screen X* AT 40,2 6IZE 286,492 PIXELS FOOT "Helvetica" ,265 COLOR ft 
? 'Viral elements i ' '^ZT 
SCREEN 1 TOPE 0 HEADING "Screen 1- AT. 40,2 SIZE 286,493 PIXELS PCNT ,7 C&&R 3 8 0 0* 

list OFF fields number , D,F, Z,R, ENTRY, S, DESCRIPTOR, BGFREQ, RFEND, RATIO, I FOR Rb'V* 

SCREEN 1 TYPE 0 HEADING 'Screen 1- AT 40,2 SIZE 286,492 PIXELS FOOT -Helvetica ", 265 COLOR n 
? 'Xinases and Phosphatases! • . u 
SCREEN 1 'TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT ■Geneva', 7 COLOR 0 0 0 
• list OFF fields number, D,F,Z,R,INTRY, SiDESOllPIOR^BCFRiQ^END.RATIO,! FOR Ra'Y 1 ' ' 

SCREEN l'TVPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FCNT "Helvetica- ,265 COLOR 0 
7 'Tumor-related antigens! 1 w 
SCREEN 1 TOPE 0 HEADING ."Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva",? COLOR 0 0 0 
list OFF 'fields nuiriber,D, F,Z,R,HCIRY, 3 , DESCRIPTOR, BCP1^,R?nJD, RATIO, I FOR Ra'A' * ' ' 

?. 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT "Helvetica", 268 COLOR 0 

7 ' PROTEIN SYNTHETIC MACHINERY- PROTEINS 1 * 

? . ' ■ .* . ' 

SCREEN 1 TYPE 0 HEADIN3 'Screen, 1" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica* ,265 COLOR 0 

? 'Transcription and Nucleic Acid-binding proteins i» 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva", 7 COLOR 0.0.0. 
list OFF fields number, D,F,Z,R,EOTW»S, DESCRIPTOR, BG FOR R=^« ' 

^a^^PE^O HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS ' FONT 'Helvetica", 265 COLOR, 0 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FCNT "Geneva", 7 COLOR 0 0 O * 
list OFF fields number , D, F , Z , R> ENTRY, 6 , DESCRIPTOR, EGFREQ , RFIND, RATIO, I FOR R='T' 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT' 40,2 SIZE 286,492 PIXELS FONT "Helvetica" ,265 COLOR 0 
? 'Rib oscmal proteins: • ' 
SCREEN 1 TYPE 0 HEADING "Screen 1- AT 40,2 SIZE 286,492 PIXELS FONT 'Geneva*,? COLOR 0,0 0 
list OFF fields nuiTber, D;F,Z,R,ENITO,S, DESCRIPTOR, BGFl^2Q^RFIOT f RATIO, I FOR R* ! R» w ' u ' u ' 

SCREEN 1 TYPE 0 HEADING ."Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica", 265 COLOR 0 
? 'Protein processing: 1 

f??^! TP?. 0 KEA P Ero "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva", 7 COLOR. 0,0.0, 
list OFF fields number, D, F, Z , R, ENTRY, S, DESCRIPTOR, B3FREQ , RFEND, RATIO, I FOR R« ! L ; ^^^ 

.SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS . FONT "Helvetica- ,268 COtAR 0 



? 1 ENZYMES 1 
? 



f 0 ^^^^^™ 1 * 35 ' Screen 1> AT 40,2 SIZE 2 ? 6 ' 492 font "Helvetica ",265 COLOR 0 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva'. 7 COLOR 0 0 0 
list OFF fields number, D,F,Z,R,2NlTlY, t S,DESCRIPICR,BGFT^,RFHro ( RATIO,I FOR rJf' 

SCREEN 1 TYPE 0 HEADING "Screen- 1/ AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica", 265 COLOR 0 
? •Proteases and inhibitorB: , . • ' •/ • A v 

f?*! 2 ^ IF?. 0 H^^'SCX^ 1' *T 40,2 SIZE 286,492 PIXELS FONT "Geneva ',7 COLOR 0,0,0, 
list OFF fields number ,'D,F,Z,R, ENTRY, S,CSSCRIJTOR,BGFREQ, RFEND, RATIO, I FOR R=5p« 

f^^J-J^ 0 HE ^ DIN ? 'Screen 1- AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica" ,265 COLOR 0 
? 'Oxidative phosphorylation: ' . 
SCREEN 1 TYPE 0 HEADIN& "Screen 1" AT 40,2 SIZE 286,492 PIXELS FOOT "Geneva-, 7 COLOR 0,0.0 
list OFF fields number ,D, F, Z, R, ENTRY, S, DESCRIPTOR, SGFREQ, RFEND, RATIO, I FOR R='Z' ' 

SCREEN 1 TYPE 0 HEADING* "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica ",265 COLOR 0 
? 'Sugar metabolism: ■ * ' ■ w 

?FSL1 TP?. 0 HEA ? IKa "Screen 1- AT 40,2 SIZE 236,492 PIXELS FONT 'Geneva", 7 COLOR 0,0,0, 
list OFF fields nmber,D f F,Z, R,EIJTRY, 9, DESCRIPTOR, BGFREQ, RFEND, RATIO, I FOR Rb'Q' . 

SCREEN'l TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica", 265 COLOR 0 
7 'Amino acid metabolism: ' 

SCREOI 1 TYPE 0 HEADING "Screen 1' AT 40,2 SIZE 286,492 PIXELS FONT "Geneva", 7 COLOR 0,0,0/ 
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list OFF fields nuxnber,D,F,Z,R, ENTRY, S, DESCRIPTOR, BGFRBQ,RFEND,RATIO r I FOR Ra'M 1 

SCREEN 1 TYPE 0. HEADING -Screen 1- AT 40,2 SIZE 266,492 PIXELS PCOT "i£?&ficl",9& &§GR 0 
? "Nucleic acid metabolism; '• ' 
SCREEN 2 .TYPE O'XEADBK •Screen AT 40,2 SIZE 286,492 PIXELS FCOT • Geneva", 7 COLOR 0,0,0/ 
list, OFF fields number , D, F, Z, R, ENTRY, 5, DESCRIPTOR, BGFHEQ, RFEND, RATIO, ■! FOR Ro'N' 

'SCREEN '1 TYPE 0 HEADING 'Screen 1- AT 40,2 SIZE 286,492 PIXELS* FOOT •Helvetica*, 265 COLOR 0 
? 'Lipid metabolism: ' * 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FCNT "Geneva", 7 COLOR 0,0,0, 
list OFF fields nuinber,D,F,Z,R,^mY,S,DE^RIPIOR,BCFREQ,R?a©, RATIO, I FOR Rs'W 

SCREEN 1 TYPE 0 HEADING 'Screen 1' AT 40,2 SIZE 286,492 PIXELS FOOT "Helvetica \ 265 COLOR 0 
? 'Other enzymes:' 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva" ,7 COLOR 0,0,0, 
list OFF fields nuniber ,D, F, Z, R, ENTRY, S, DESCRIPTOR, BGFREQ* RFEND, RATIO, I FOR R='E' 

? . ... '•..•* 

SCREEN 1 TYPE 0 HEADING 'Screen 1" AT 40,2 SIZE 286,492 PIXELS FCNT "Helvetica", 2 68 COLOR 0 

? ' ' 

7 1 MISCELLANEOUS CATEGORIES 1 

? 

SCREEN 1 TYPE 0 HEADING "Screen 1' AT 40,2 SIZE 286,492 PIXELS FCOT ■Helvetica" , 265 COLOR 0 

? 'Stress response i * * 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT 'Geneva", 7 COLOR 0,0,0, 
list OFF fields nvmber,D, F;Z,R,EITOY,S,DESCRIPTOR,BGFREQ,RFEND, RATIO, I FOR R='H' 

SCREEN 1 TYPE 0 HEADING 'Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica ",265 COLOR '0 
7 'Structural:' ' . 

SCREEN 1 TYPE 0 HEADING "Screen 1' AT 40,2 SIZE 286,492 PIXELS BOOT "Geneva", 7 COLOR 0,0,0, 
list OFF fields, number, D,F,Z,R,aiIRY,S,DSSCRIPTOR,BGFRE5,RFaJD,I^TIO,l;FOR R='K' 

SCREEN 1 TYPE 0 HEADING 'Screen 1* AT 40)2 SIZE 286,492 PIXELS FOOT 'Helvetica ",265 COLOR '0 
? • 'Other clones: • 

SCREEN 1 TYPE 0 -HEADING "Screen 1" 'AT 40,2 SIZE 286,492 PIXELS ' KMT* "Geneva", 7- COLOR 0,0,0. 
list OFF fields nuniber, D, F » Z , R, ENTRY, S , DESCRIPTOR, BGFREQ, RFEND , RATIO, I FOR R='X' 

SCREEN 1 TYPE 0 HEADING "Screen l" AT 40,2 SIZE 286,492 PIXELS FONT "Helvetica" ,265 COLOR 0 
? 'Clones' of unknown function i • • . . 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva",7 COLOR '0,0) 0, 

list OFF fields nunber,D,F,Z,R,ENTRY,S,DiESCRIPTOR,BGFREQ,RF£32D,RATIO, I FOR R«'U' 

Q3DCA6E 

DO 'Test print .prs" 

SET PRINT OFF 

SET DEVICE TO SCREE? 

CLOSE DATABASES 

ERASE TEMPLIB.DBF 

ERASE TEMPNUM»DBF 

ERASE TEMPDESIG.DBF 

SET MARGIN TOO 

CLEAR 

LOOP 

ENDDO 
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•Nbrfchem (single) , version 11-35-94 

close databases 

SET TALX OFF 

SET PRINT OFF* 

SET EXACT OFF 

CT2AR ' 

STORE ' TO Eobject 

STORE • • TO Dcbject 

STORE 0 TO Numb. 
STORE 0 *TO Zog 
STORE 1 TO Bail 
DO WHILE .T. 

.* Program, i Northern (single) .fiat 

* Data....; 8/ 8/94 

* Version, « .FoxBASE+/Mac,' revision 1.10 

* Notes. v*.. 2 .Format file Northern (single J 



SCREEN 1 TYPE 0 HEADING •Screen 1- AT 40,2 SIZE 286,492 PIXELS Fmr 10 /vnt™ v. « A 

8 PIXELS 15,81 TO 46,397 STYLE 28447 COLOR 0, 0, -1, -aSSOofrif-l V ? 

8 PIXELS 89,79 TO 152,422 STYLE 28447 COLOR 6,6,0-25600-1-1 

© PIXELS 115,98 SAY 'Entry f BmE. 65536 FONT •Geneva ■ ! 12 COLOR o o ft i 1 , 

6 ^ELS 115,173 GET Eob^ctSTYIBO^ , « 

| PIXELS 145 f 89 BAY 'Description' STVUS 65536 F^T -G^evS VIS ^L^f 0 fi r ? # "} / "? # " 1 

2 2fi a ? » Y "Single Northern search screen- STVLB 65536 ^^w2-274 «t^« a ft 

8 PIXELS 220,162 GET Bail STYLE 65536 FONT 'Chicago- lST PICTOM^fl*R of < 

• PIXELS 175.98 SAY -Clone #:< S1YLE 65536 FOOT '^^-fl?^ ° Ut SIZE 

2 l£^f iftJPJ 3 ™ Nuin k STYLE 0 FONT -CewwrnMaS is.10 GQLOr' o'o'o W-i 
8 PIXE* 80,152 SAY -Efcter any ONE of the following,- ST^^S™^ 0 :^?;^ _ r 



♦ EOF: Northern ( single), fmt 
READ 

IF Bails2 
CLEAR . 
s creen 1 off 
'HS7UBN 
B©IF 

USB "SmartGuy j FoxBA£E*/Mac jFox files i Lookup, dbf 
SET TALK ON - • 

IF Eotdecto' 

STORE UPPER (Eobject) to Eobjeot 

SEX SAFETY OFF . 

SORT O N En try to "Lookup entry.dbf • 

SET SAFETY CN 

USB "Lookup entry, dbf 

LOCATE FOR LookeEcbject 

IF ..NOT.FOUND{) * 

CLEAR 

LOOP 

BSDdF 

BRCW3E 

STOR2 Entry TO Searchval* 

CLOSE DATABASES 

ERASE ."Lookup entry. dbf ■ 

E3DIF 

•IF-Dobjecto* • 
SET EKACT OFF 
SET SAFETY OFF 

SORT'CN descriptor TO "Lodkup* descriptor, dbf" 
SET SAFETY On 

USE ■ Lookup descriptor, dbf - 

LOCATE FOR UPPER (TRIM (descriptor) ) sUFPER (TRIM (Dobi eetl \ 

IF -NOT. FOUND () 

CLEAR 
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LOOP 

B3DXF 

BROWSE 

STORE Entry TO Searcbval 

CLOSE DATABASES • . 

ERASE 'Lookup descriptor.dbf ■ 

SET EXACT Of 
ENDXF • 

IF NuroboO 

USE 'Smart Guy tFoxBASE* /Mac ;Fox files : clones. db£> 

GO Ntsiib \ 

BROWSE 

.STORE Entry TO Searchval 
WDJF 

GC£AH 

? • Northern analysis for entry 1 
?? Searcbval 
? . 

? 'Eicer Y to proceed' 

WAIT TO OK • 

CLEAR 

IF. UPPER (OK)o»Y< 
screen 1 off 
RETURN 
ENDXF 

* COMPRESSic^ ' SUBROUTINE FOR Libraiy,db£ 

7 'Compressing the Libraries file now;-..'. 

USE " SmartGuy : FoxBASE* /Mac : Fox files: libraries, dbf 

SET SAFETY OFF i 

SORT ON library a TO •compressed libraries. dbf* 

* FOR entered>0 ' 
SET SAFETY ON 

USE "Ccnpressed libraries-, dbf ■ 

DELETE FOR entereoVO 

PACK 

COUNT TO TOT 
KARK1 = 1 
SW2»0 . 

DO WHILE SW2=0 ROLL 
' IF MARJyl >a TOT 
. PACK , ' 

LOOP 

EMDIF 
GO MARK1 . 
* STORE library TO TESTA 
'SKIP 

Store Library TO testb 

IF TESTA = TESTB 

DELETE 

ENDXF 

MARK1 * HARK1+1 ; 
LOOP ' 
£NDDO ROLL 

* Northern analysis 
CLEAR 

? 'Doing the northern new. , . 

SET TALK ON ... 
USE "SmartGuyiFoxBASEf/KaciFox filesi clones, dbf"* 
SET SAFETY ^fS^F 

COPY TO "Hits. dbf" FOR entryosearehval 
SET SAFETY CN 
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* MASTER ANALYSIS 3; VERSION 12-9-94 

* Master menu for analysis output 
CLOSE DATABASES 

SET TALK OFF 
SET SAFETY OFF 
CLEAR 

SET DEVICE TO SCREEN 

SET DEFAULT TO •SmartGuyiFoxBASEf /Mac: fox filesiOutput programs:" 
USE M SmartGuy:FoxaASE+/Mac:£ox files :Clones.dbf u 
GO TOP 

STORE JJUMB5R TO INITIATE 
GOBOTT0M 

STORE NUMBER TO TERMINATE 
STOR E 0 TO aJTIRS 
STORE 0 TO CQNDEN 
STORE 0 TO ANAL 
STORE 0 TO EMATCH 
STORE 0 TO HMATCH 
STORE 0 TO OMATCH 
STORE 0 TO IMATCH 
STORE 0 TO XMATCK 
STORE 0 TO PRINTON 
STORE 0 TO PTF 
DO WHILE „*T. 

* Program.: Master analysis, fint 

* Date....: 12/ 9/S4 

* Version.: FoxBASE+/Mac, revision 1.10 

* Notes. ...» Format file Master analysis 
* 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40 # 2 SIZE 286,492 PIXELS FONT 'Geneva" , 9 COLOR 0,0,0, 
G PIXELS 39,255 TO 277,430 STYLE 28447 COLOR 0,0,-1,-25600,-1,-1 
6 PIXELS 75,120 TO 178,241 STYLE 3871 COLOR 0,0,-1,-25600,-1,-1 

<a pixels 27,98 SAY "Customized Output Menu 8 STYLE 65536 FCOT "Geneva 1 , 274 COLOR 0,0,-1,-1,-1 
0 PIXELS 45,54 GET conden STYLE 65536 FONT ■Chicago", 12 PICTURE «@*C Condensed format- SIZE 
Q PIXELS 54,261 GET anal STYLE 65536 FONT "ChicagoM2 PICTURE »@*rv Sort /number; Sort/ entry i 
6 PIXELS 117,126 GET EMATCH STYLE 65536 FONT "Chicago*,^ PICTURE *©*C Exact 0 SIZE 15,62 CO 
Q PIXELS 135,126 GET HMATCH STYLE 65536 FOOT -Chicago - , 12 PICTURE U G*C Homologous 11 SIZE 15,1 
@ "PIXELS 153,126 GET OMATCH STYLE 65536 FONT "Chicago", 12 FICTURE "&*C Other SpO" SIZE 15,84 
Q PIXELS 90,152 SAY "Matches:" STYLE 65536 FONT "Geneva" ,268 COLOR 0,0,-1,-1,-1,-1 
G PIXELS 63,54 GET PRINTON STYLE 65536 FOOT 'Chicago", 12 PICTURE "@*C Include clone listing 1 
@ PIXELS 171,126 GET Imatch STYLE 65536 FONT "Chicago*, 12 PICTURE "<I*C Incyte" SIZE 15,65 CO 
0 PIXELS 252,146 GET initiate STYLE 0 FONT "Geneva",^ SIZE* 15,70 COLOR 0,0,-1,-1,-1,-1 
0 PIXELS 270,146 GET terminate STYLE 0 FONT 'GenevaM2 SIZE 15,70 COLOR 0,0,-1,-1,-1,-1 
G PIXELS 234,134 SAY -include clones 1 STYLE 65536 FONT •Geneva", 12 COLOR 0,0,-1,-1,-1,-1 
Q PIXELS 270,125-SAY -->" STYLE 65536 FOOT "GenivaM4 COLOR 0,0,-1,-1,-1,-1 
6 PIXELS 198,126 GET PTF STYLE 65536 FONT 'Chicago-, 12 PICTURE M @*q Print to file* SIZE 15,9 

5 PIXELS 189,0 TO 257,120 STYLE 3871 COLOR 0,0,-1,-25600,-1,-1 

6 PIXELS 209,8 SAY -Library selection' STYLE 65536 FOOT "Geneva", 266 COLOR 0,0,-1,-1,-1,-1 

G PIXELS 227,18 GET ENTIRE , STYLE 65536 FONT 'Chicago", 12 PICTURE "§*RV All; Selected' SIZE 16 

* EOF: Master analysis, fot 
READ 

IF ANAL=9 
CLEAR 

CLOSE DATABASES 
ERASE TEMPMASTERiDBF 

USE n SmartGuyjFoxBASE+/Mac:fox files: clones. dbf* 
SET SAFETY ON 
SCREEN 1 OFF 
RETURN 
END IF 
clear 

7 INITIATE 
? TERMINATE 
?. CONDEN 

? ANAL 
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7 ematch 
? Hmatch 
? Canatch 

? IMATCK 
SOT TALK ON 

IP ENTIRE«2 
USB "Unique libraries .'dbf« 

REPLACE ALL i WITH ■ ' 

BROWSE FIELDS i, libname, library, total, entered AT 0,0 
ENDIF 

USE "SmartGny:FoxBASE+/Mac:fox f iles: clones. dbf" 

*COFV TO TSMFNUM FOR NUMBER>o INITIATE ,£ND . NaM5ER<=TSKMIKATE' 

*US2 TOMPNUM 

COPY STRUCTURE TO TEMPLIB 
USE TEKPLIB 
IP ENTIREol 

APPEND FROM 'SmartGuy :Fo*BASE+/Macjfox files i Clones . dbf 
ENDIF 

IP ENTIRES 
USE "Unique libraries .dbf • 

COPY TO SELECTED FOR UPPER{i) = 'Y» 
USE SELECTED 

STORE RSCCOUNTO TO STOPIT 
MARXsl 

DO WHILE .T. 

IP MARK>STOPIT 

CLEAR 

EXIT 

ENDI? 

USE SELECTED 
GO MARK 

STORE library TO THISGNE 
? 'COPYING 1 
?? TrJISOME 
USE TEMPLIB 

APPEND PROM tf SnartGuyiFoxBASE+/Mac:fox files: Clones, dbf" FOR librarv-TOlsONE 
STORE MAR2C+1 TO MARK . 
LOO? 
ENDDO 
ENDIF 

USE u SmarcGyy:FoxBASE+/Kac:fox files : clones. dbf" 

COUNT TO STARTOT 

COPY STRUCTURE TO TEMPDESIG 

USE TEMPDESIG 

IP EhiatchcO .AND.. Hmatch=0 .AND. Croat ch=0 .AND. IMATCHbO 

APPEND FROM TEMPLIB 

ENDIF 

IF Emacchxl \ 

APPEND PROM TEMPLIB FOR D='E' 
ENDI? 

IP Hmatchol 

APPEND PROM TEMPLIB FOR D='H' 
ENDIF 

IP Qcnatchsl 

APPEND PROM TEMPLIB FOR D='0' 
ENDIF 

IF Imatchal 

APPEND FROM TEMPLIB FOR D='I' .OR.Ds'X' .OR.D*'N' 
ENDIF 

IF Xmatchol 

APPEND PROM TfMPLIB FOR D='X' 

ENDIF 
COUNT TO ANALTOT 
set talk off 



DO CASE 
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CASE PT?=0 

SET DEVICE TO PRINT 

SET PRINT ON 

EJECT 

CASE PTFsl 

SET ALTERNATE TO "Total function sort.txfc' 1 

•SET ALTERNATE TO n H and 0 function sort.txt 0 

*SET ALTERNATE TO "Shear Stress HUVEC 2iAbur.dar.ee sort.txt" 

♦SET ALTERNATE TO "Shear Stress HUVEC 2 s Abundance con.txt" 

*SET ALTERNATE TO "Shear Stress HUVEC 2 .-Function sort.txt - 

*SET ALTERNATE TO "Shear Stress HUVEC 2 : Distribution sort.txt" 

*SET ALTERNATE TO "Shear stress HUVEC 1; Clone Ust.txt" 

*SET ALTERNATE TO "Shear Stress HUVEC 2 \ Location Bort.txt" 

SET ALTERNATE ON 

ENDCASE 

****+*»************+* 

IF FRINP0N=1 

©1,30 SAY "Database Subset Analysis' STYLE 65336 FONT ■Ganeva\274 COLOR 0,0,0,-1,-1,-1 

ENDIF 

7 

? 

o 

7 date() 
?? ' 1 
7? TIMBO 

7 'Clone- numbers 1 

?? STR ( INITIATE , 6,0) 

?? 1 through • 

7? STR (TERMINATE, 6,0) 

7 'Libraries: • 

IP ENTIRE= 1 

7 'All libraries' 

ENDIF 

IP ENTIRE=2 
MARlUl 
DO WHILE .T. 
IF MARK>STOPIT 
EXIT 
ENDIF 

USE SELECTED 
GO MARK 
? ■ « 
. 77 TRIM(lihname) 
STORE MARX+1 TO MARK 
LOOP 
ENDDO 
ENDIF 

7 'Designations: ■ 

IF BcnatchaO .AND. Hmatch=0 .AND. Qmatch=0 .AND. IMATCH=0 

77 •All' 

ENDIF 

IF finatch=l 
7? 'Exact, • 

ENDIF 

IF Hmatch=l 

77 'Human, ' 

ENDIF • 

IF Omatchsl 

77 'Other .sp. 1 

ENDIF 

IF Imatch=l 
77 1 INCVTE' 
ENDIF 

IF Xrcatch=l 
77 'EST 1 
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ENDIF 

IF CGNDEN«1 

? 'Condensed format analysis' 

ENDIF 

IP ANALel 

? 'Sorted fay NUMBER' 

ENDIF 

IF ANAL=2 

? 'Sorted by ENTRY 1 

ENDIF 

IF ANAL=3 

? 'Arranged toy ABUNDANCE 1 

ENDIF 

IF ANAU4 

? 'Sorted by INTEREST' 

ENDIF 

IF ANAL=5 

? 'Arranged by LOCATION' 
ENDIF ' 
IF ANAL=6 

? 'Arranged by DISTRIBUTION' 
IF ANAL=7 

? 'Arranged try FUNCTION 1 
ENDIF 

? 'Total clones represented: 1 

?? STR(STARTOT,6,0) 

? 'Total clones analyzed-* 1 

?? STR(ANNjTOT,6,0y 

? 

v '1 = library d = designation f - distribution z = location r = function c •= cer 
? 

USE TEMPDESIG 

SCREEN 1 TYPE 0 HEADING "Screen V AT 40,2 SIZE 286,492 PIXELS FONT d Geneva\7 COLOR 0,0,0, 
DO CASE 
CASE ANACicl 

* sort/number 
SET HEADING ON 
IF CONDENal 

SORT TO TEMPI ON ENTRY, NUMBER 
DO -CCMPRSSSION number. PRG" 

SORT TO TEMPI CN NUMBER 
USE TEMPI 

list off fields number, L / D,F,Z,R,C i an i RY,S, DESCRIPTOR 

*Hst off fields number, L,D,F, 3, R,C,ENTRY,S, DESCRIPTOR, LEN3TH,RFEND, INIT,I 
CLOSE DATABASES 
ERASE TEMPI. D3F 
ENDIF 

CASE ANAL=2 

* eort/DESCRIPTOR 
SET HEADING ON 

*SORT TO TEMPI ON DESCRIPTOR, ENTRY , NUMBER/ S for D»'E' .CR.D=s'K' «OR.D= '0' .OR.D='X' .OR.Ds'I' 
♦SORT TO TEMPI ON ENTRY, DESCRIPTOR , NUMBER/ S for D^'E* .OR.Da'H 1 '.OR,D» 'O' .OR.D=*X' .OR.Da'I 1 
SORT TO TEMPI ON ENTRY, START/S for D= 'E' .OR.Dc'H 1 .OR.Ds'O' .OR.D='X' .OR.Ds 1 1' 
IF CCNDEN*! 

DO "OPPRESSION entry. PRG" 
ELSE 

USE TEMPI 

list off fields number, L, D, F, Z , R, C, ENTRY, S , DESCRIPTOR, LENGTH, RFEND, INIT, I 
CLOSE DATABASES 
ERASE TEMPI. DBF 
ENDIF 
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CASE ANAL=3 

* sort by abundance 
SET HEADING ON 

SORT TO TEMPI ON ENTRY, NUMBER for D='E' .OR.D='H' .OR.D= 'O* .OR.Dx'X' .OR.Do'I*' 

DO B compression abundance. prg • 

CASS ANAL-4 

* sort/interest 
SET HEADING CN 
IF CONDEN=l 

SORT TO TEMPI ON ENTRY, NUMBER FOR I>0 
DO •COMPRESSION interest . PRO" 

SORT ON I/D, ENTRY TO TEMPI FOR I>1 
USB TEMPI 

list off fields number, L,D,F, Z, R, C,EbTOY,S, DESCRIPTOR, LENGTH, HFEND, INIT, I 
CLOSE DATABASES 
ERASE TEMPI. DBF 
ENDIF 

CASE ANAL=5 

* arrange/location 
SET HEADING ON 
STORE 4 TO AMPLIFIER 
? 'Nuclear* ' 

SORT ON ENTRY,NUMBER FIELDS RFEND, NUMBER ,L,D,F, 2 ,R,C,BJTRY,S, DESCRIPTOR, L3NG1H, INIT, I , CCMMEN 
IF CCNDEN=1 

DO "Compression location. prg» 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

? 'Cytoplasmic: 1 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L, D,F,Z,R, C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I , CCMMEN 
IF CCNDEN=1 

DO "Compression location. prg" 
ELSE 

DO "Normal subroutine 1" 

ENDIF 

? 'Cytbakeleton: ' 

SORT ON ENTRY, NUMBER FIELDS RF3JD, NUM3ER, L , D , F # 2 , R, C , ENTRY, S , DESCRIPTOR, LENGTH , INIT, I , COMMEN 
IF CQ^DENsl 

DO •"Compression location. org" 
ELSE 

DO •Normal subroutine 1" 
ENDIF 

? "Cell surface: ' 

SORT ON ENTRY, NUMBER FIELDS RF2TO, NUMBER ,L,D,F, 2, R,C, ENTRY, S, DESCRIPTOR, LENGTK, INIT, I r CCMMEN 
IF CONDEN=l 

DO "Compression location, prg" 
ELSE 

DO "Normal subroutine 1* 
ENDIF 

? 'Intracellular membrane: 1 

SORT ON ENTRY, NUMBER FIELDS RFEND » NUMBER, L, D,F, 2, R,C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I ,C0MMEN 
IF CONDEN=l 

DO "Compression location. prg" 

PT.SF! 

DO "Normal subroutine 1" 
ENDIF 

? •Mitochondrial:' 

SORT ON ENTRY, NUMBER FIELDS RFEND , NUMBER, L, D, F, Z, R,C /ENTRY , S , DESCRIPTOR, LENGTH, INIT, I , CQMMW 
IF CQNDENsl 

DO 'Compression location. prg" 
ELSE. 

DO "Normal subroutine 1" 
ENDIF 
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? 1 Secret edi 1 

SORT CN ENTRY, NUMBER FIELDS RFEND,NUM3ER,L,D,F, Z, R,C, ENTRY, S, DESCRIPTOR, LENGTH, 1NXT,X ( CCMMEN 
IF C0NDEN=1 

DO "Compression location. prg" 

ELSE 

DO "Normal subroutine 1" 

ENDIF 

? • Other i ' 

SORT ON ENTRY, NUMBER FIELDS RFEND,NUMBER,L,D,F,Z,R, C, ENTRY, S, DESCRIPTOR, LENGTH, INIT,I,CCMMEN 
IF CQNDEN=1 

DO "Compression location. pro - 
ELSE 

DO^Nonaal subroutine 1* 
ENDIF 

? 'Uhtaown: ' 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L, D, F, Z, R, C, ENTRY, S, DESCRIPTOR, LENGTH, INIT# I,OCMM5N 
IF CCNDEN=1 

DO "Compression location. prg" 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

IF CONDEN=l 

SET DEVICE. TO PRINTER 

SET PRINTER ON 

EJECT 

DO "Output heading. prg' 
USE •Ana-lysis location.dbf " 
DO "Create bargraph.prg 1 
SET -HEADING OFF 

? 1 FUNCTIONAL CLASS TOTAL UNIQUE NEW % TOTAL 1 

? 

LIST OFF FIELDS Z , NAME , CLONES , GENES , NEW , FERCENT, GRAPH 
CLOSE DATABASES 
ERASE TEKP2.DBF 
SET HEADING ON 

*USE "SmartGuyjFoxBAS3+/Mac;fox files tTEMEMASTER.dbf" 
ENDIF 

CASE ANAL=6 

* arrange/distribution 

SET HEADING ON 

STORE 3 TO AMPLIFIER 

? 'Cell/tissue specific distribution:' 

SORT CN ENTRY, NUMBER FIELDS RFEND,NUMBER,L,DiF,Z,R,C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I, CQMMEN 
IF CCNDENsl 

DO "Compression distrib.prg" 
ELSE 

DO 'Normal subroutine 1" 
ENDIF 

7 'Non-specific distributions 1 

SORT ON ENTRY , NUMBER FIELDS RFEND, NUMBER, L, D,F,Z,R,C, ENTRY, S, DESCRIPTOR, LENGTH , INIT , I , COMMEN 
IF CGNDEN-1 

DO "Conpression distrib.prg" 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

? 'Untaown distribution: • 

SORT CN ENTRY, NUMBER FIELDS RFEND, NUMBER, L,D,F,Z,R,C,E7TRY,S, DESCRIPTOR, LEKGTH, INIT, I,COMMEH 
IF CCNDENbI 

DO "Ccttpression distrib.prg" 
ELSE 

DO "Nonnal subroutine 1 B 
ENDIF 

IF CCNDEN=1 

SET DEVICE TO PRINTER 

SET PRINTER CN . 
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KJECT 

DO "Output heading, prg 1 

USE "Analysis distribution. dbf 

DO 'Create bargraph.prg" 

SET HEADING OFF 

? • FUNCTIONAL CLASS TOTAL UNIQUE % TOTAL' 

? 

LIST OFF FIELDS P. NAME, CLONES, GENES, PERCENT, GRAPH 
CLOSE DATABASES 
ERASE TEMP2.DBF 
SET HEADING ON 

•USE °SmartGuy:FoxBASE+/Mac:£cx files :TEMPMASTER.dbf rt 
ENDIF 

CASE ANAL=7 

* arrange/ function 

SET HEADING ON 

STORE 10 TO AMPLIFIER 

? 1 BINDING PROTEINS* 

? 

? ' Surface molecules and receptors i ■ 

?2 R L? N EKTOY ' mjM3ER FIELDS RFEND , NUMBER, L, D,F, Z,R, C, ENTRY, S* DESCRIPTOR, LENGTH, INIT, I, COMMEN 
IF CONDENsl 

DO "Compression function. prg" 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

? 'Calcium-binding proteins; ' 

SORT ON ENTRY , NUMBER FIELDS RFEND, NUMBER, L, D, F, 2, R,C, ENTRY, S, DESCRIPTOR, IJE3QGTH, INIT, I, COMMEN 
IF CONDEN=l 

DO •Canpression function .prg 0 
ELSE 

DO 'Normal subroutine l u 
ENDIF 

? 'Ligands and effectors! 1 

SORT ON ENTRY, NUMBER FIELDS RFDJD, NUMBER, L, D, F , Z , R, C , ENTRY , S , DESCRIPTOR , LENGTH, INIT, I , CCMMEtf 
IF CCNDEN-1 

DO 'Compression function, prg' 
ELSE 

DO 'Normal subroutine l m 
ENDIF 

? •Other binding proteins:' 

SORT^^ENTRY,NUMBER FIELDS RFEND, NUMBER, L, D, F, Z, R, C, ENTRY, S , DESCRIPTOR, LENGTH* INIT, I, COMMEN 

DO 'Compression function .prg" 

DO 'Normal subroutine l b 

ENDIF 

•EJECT 

? 1 ONCOGENES' 
? 

? 1 General oncogenes! 1 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L, D, F, Z, R, C, ENTRY, S , DESCRIPTOR, LENGTH, INIT, I, COMMEN 
IF OQNDEN=l 

DO "Compression function .pro" 
ELSE 

DO •Normal subroutine 1" 
ENDIF 

? 'GTP-binding proteins i 1 

?2 R L^ f S N 7 RY ' NUMBER FIELDS RFEND, NUMBER, L, D,F, Z, R, C, ENTRY , S , DESCRIPTOR , LENGTH, INIT, I, COWMEN 
IF CQNDEN~1 

DO 1 "Compression function. prg" 
ELSE ' 

DO "Normal subroutine 1" 
ENDIF 

? 'Viral elements i • 
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SORT ON ENTRY, NUMBER FIELDS RFEND,NUXBER,L,D,F,Z,R,C, ENTRY, S , DESCRIPTOR, LENGTH , INIT, I , COMMEN 
IF CQNDEN=1 

DO "CorcpreBBion function. prg* 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

? 'Kinases and Phosphatases ; ' 

SORT ON ENTRY, NUM3ER FIELDS RFEND,NUMBER,L,D,F,Z,R, C, ENTRY, S, DESCRIPTOR , LENGTH, INIT, I, COMMEN 
IF CONDEN=l 

DO "Cbitpreasion function. prg* 
ELSE 

DO "Normal subroutine 1° 
ENDIF 

? ' Tumor- related antigens t 1 

SORT ON ENTRY, NUMBER FIELDS RFSND, NUM3ER, L, D,F,Z,R,C, ENTRY, S, DESCRIPTOR, LENGTH, INIT,I,COMM5N 
IF CONDENal 

DO "Compression function. prg 1 
ELSE 

DO "Normal subroutine 1" 

a©iF 

★EJECT 

? ' PROTEIN SYNTHETIC MACHINERY PROTEINS' 

? 

? 'Transcription and Nucleic Acid-binding proteins: 1 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L,D,F, Z, R, C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I , COhMEN 
IF CONDENol 

DO "Compression function. prg* 
ELSE 

DO "Normal subroutine 1" 

ENDIF 

? 'Translation! 1 

SORT ON ENTRY, NUMBER FIELDS RFEND,NUMBER,L,D,F, Z,R,C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I, COMMEN 
IF CONDENsl 

DO "Concession function. prg" 
ELSE 

DO "Normal subroutine 1" 

ENDIF 

? 'Ribosonal proteins: 1 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L r D,F, Z, R, C r ENTRY, S, DESCRIPTOR, LENGTH, INIT,I/COMMEN 
IF CONDENal 

DO "Compression function. prg" 
ELSE 

DO - Norroal subroutine 1" 
ENDIF 

? 'Protein processing i 1 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMB2R,L,D, F, Z,R,C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I, COMMEN 
IF CQNDEN=1 

DO "Compression f unction. prg tt . 
ELSE 

DO "Normal subroutine I s 

ENDIF 

* EJECT 

? • ENZYMES' 

? ' Ferroproteins i ' 

SORT ON ENTRY, NUMBER FIELDS RFEND , NUMBER, L # D,F, 2, R, C, ENTRY, S, DESCRIPTOR, LENGTH, INIT,I, COMMEN 
IF CONDENsl 

DO "Compression f unction. prg n 

DO "Normal subroutine 1" 
ENDIF 

? 'Proteases and inhibitors:' 

SORT ON ENTRY, NUMBER FIELDS RFEND, NUMBER, L, D, F, Z, R, C, ENTRY, S, DESCRIPTOR, LENGTH, INIT, I, COMMEN 
IF CONDENsl 

DO "Compression function.prg" 
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DO "Normal subroutine 1" 

? 'Oxidative phosphorylation: 1 

IF CGNDENsI 

DO "Compression- function. pro 1 
ELSE 

DO "Normal subroutine 1" 
E2DIF 

? 'Sugar 'metabolism; 1 

f^^^^' 2 ^ 3 ^ FIELDS *^'^ ffl ^'D,F,Z,R,C,Ii^ 

DO 'Conipression function. prg 1 
ELS2 

DO ■Normal subroutine 1' 
? 'Amino acid metabolism: • 

IP R C0NDEN^ Y '^ CER F1ELDS * 3 ^' I ^ K ' L ' D ' F ' 2 ' R ' C ' E ^ 

DO "Carqpression function. prg' 
ELSE 

DO "Normal subroutine 1* 
ENDXF 

? 'Nucleic acid metabolism? • 
DO 'Compression function. prg ' 

ELSE 

DO 'Normal subroutine 1" 
ENDXF 

? 'Lipid metabolism: ' 

IF R Cc£S RY ' ?IELDS R *^'* fU *^<L,D,F,^ 
DO 'Con^jression function. prg* 

ELSE 

DO -Normal subroutine l w 
ENDIF 

? 1 Other en2yraesi • . ' 

SORT ENTRY , NUMBEK FIELDS RF^ # HDMBER # LiDiF,Z»RiC,airK # S 

IF CCNDSN=1 ' 1 

DO 'Conpression function .prg" 
ELSE 

DO "Normal subroutine 1 ■ 

ENDIF 

♦EJECT 

* ' MISCELLANEOUS CATEGORIES' 

? ' Stress f response : 1 

?2 R L2L ENTRY ' mSSR "HLBS RFSND, NUMBER, L,D,F,Z,R,C, ENTRY, S, DESCRIPTOR, LEK^ I.COMMEN 
IF CGNDENsl 

DO 'Compression function. prg - 
ELSE 

DO 'Normal subroutine 1" 
ENDIF 

? 'Structural; ' 

52 R T JSL 3 ™* ' NUM 9 ** FIELDS RFEND, NUMBER, L,D,F, 2 ,R,C, ENTRY, S, DESCRIPTOR, LEN^ # mCT. I, COWMEN- 

DO 'Conpression function.prg - 
ELSE 

DO 'Normal subroutine 1° 
ENDIF 

? 'Other clones! ' 

i?CoSS RY,miBER PIELDS K™'***^'*'?'*'*'**^ 

DO "Compression f unction. prg 0 

ELSE 
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DO ■Normal subroutine 1" 
EHDIF 

? 'Clones of unknown function : ' 

SORT ON ENTRY, NUMBER FIELDS OTB©,NUM3ER,L,D,F,Z,R,C,£N!IOT^ 
IF CQNDEN=1 

DO - Coji\preflaion function .prg* 
ELSE 

DO "Normal subroutine 1" 
ENDIF 

IF CONDEN*! 
EJECT 

*SBT DEVICE TO PRINTER 

*SET PRINT CN 

DO 'Output heading. pro ■ 

USE "Analysis function. dbf ■ 
DO "Create bargraph.prer 1 ' 
SET HEADING OFF 

SCREEN 1 TYPE 0 HEADING "Screen 1* AT 40,2 SIZE 296,492 PIXELS FONT M Geneva\12 COLOR 0,0,0 

; ' . . TOTAL TOTAL NEW DIST 

? • FUNCTIONAL CLASS CLONES GENES GENES FUNCTIONAL CLASS ' 

*** 

*LIST OP? FIELDS P, NAME, CLONES, GENES, NEW, PERCENT, GRAPH, COMPANY 
LIST OFF FIELDS P , NAME, CLONES , GENES , NEW, PERCENT , GRAPH 
CLOSE DATABASES 
ERASE TEMPS .DBF 
SET HEADING ON 

*USE * SrrartGuy : FoxBASE+/Mac i fox files iTEMPMASTER.dbf" 
ENDIF 

CASE ANAL=8 

DO "Subgroup summary 3.prg° 
ENDCASE 

DO "Test print, prg" 
SET PRINT OFF 
SET DEVICE TO SCREEN 
CLOSE DATABASES 
•ERASE TEHPLIB . DBF 

* ERASE TEMPNUM.DBF 

* ERASE TEMPDESIG » DBF 
* ERASE SELECTED. DBF 
CLEAR 

LOOP 
ENDDO 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAM 
USE TEMPI 
COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MASKl = 1 

6W2&0 

DO WHILE SW2=0 ROLL 
IF MARK1 >d TOT 
PACK 

COUNT TO UNIQUE 

COUNT TO NEWGENES FOR D=*H I .OR.D^O' 

SW2=1 

LOOP 

ENDIF 
GO MARK1 
DUP a 1 

STORE EOTRY TO TESTA 
SW » 0 

DO WHILE SW=0 TEST 
SKIP 

STORE ENTRY TO TESTS 

IF TESTA = TESTS 

DELETE 

DUP = DUP-rl 

LOOP • 

ENDIF 
GO MARKI. 

REPLACE RFEND WITH DUP 
MARKI ■ MARXl+DUP 
SW=1 
LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
•GO TOP 

STORE Z TO LOC ' 

USE 'Analysis location. dbf* 

LOCATE FOR 2«LOC 

REPLACE CLONES WITH TOT 

REPLACE GENES WITH UNIQUE 

REPLACE NEW WITH NEWGENES 

USE TEMPI 

SORT ON RFEND/D TO TEMP2 

USE TEMP2 

?? STR(UNIQUE,5,0) 

?? 1 genes, for a total of 1 

?? STR(TOT, 5| 0) 

?? 1 .clones 1 

? ' V Coincidence 1 

list off fields number! RFEND # L, D#F / Z , R, C# ENTRY r Si DESCRIPTOR, LENGTH, INIT, I 

*SET PRINT OFF 
CLOSE DATA3ASES 
ERASE TEMPI. DBF 
ERASE TEMP2.DBF 
USE TEMPDESIG 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 

USE TEMPI 

COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MARKl e l 

SW2«0 

DO WHILE SW2=0 ROLL 
IP MARKl >= TOT 
PACK 

COUNT TO UNIQUE 

SW2=1 

LOOP 

ENDIF 
GO MARKl 
EOT = 1 

STORE ENTRY TO TESTA 
SW m 0 

DO WHILE SW=0 TEST 
SKI? 

STORE ENTRY TO TESTS 

IF TES TA = TESTE 

DELETE 

DOT « DUP+1 

LOOP 
* ENDIF 
GO MARKl 

REPLACE RFEND WITH DUP 

MARKl = MARK1+DU? 

SW=1 

LOOP . 

ENDDO TEST 

LOOP 

H2JDDO ROLL 
•BROWSE 

-♦SET PRINTER ON 

SORT ON DATE TO TEMP2 

USE TEMP2 

?? STR (UNIQUE, 4,0) 

?? • genes, for a total of 1 

77 STR(TOT,4,0) 

77 •> clones 1 

? 

7 ■ V Coincidence 1 

COUNT TO P4 FOR 1-4 

IF P4>0 

7 STR(P4,3,0) 

?? 1 genes with priority = 4 (Secondary analysis:) ' 

list off fields number , RFEND , L, D , F , Z , R, C , ENTRY, S , DESCRIPTOR , LENGTH, INIT for 3«4 
? 

ENDIF 

COUNT TO P3 FOR 1*3 

IF P3>0 

? STR(Pa,3,0) 

?? 1 genes with priority a 3 (Full insert sequence; ) 1 

list off fields number, RFEtD,L,D,FiZ,R/C,ENrRY,S,DESCRIPTOR,LZITOTK,INIT for 1=3 
ENDIF 

COUNT TO P2 FOR 1=2. 

IF P2>0 

? SrR(P2,3 ( 0) 

?? 1 genes with priority » 2 {Primary analysis complete:) 1 

list off fields nuniter,RFm>,L ( D,F,Z,R,C,ENraY,6,DESCRI^ for 1=2 

? 

ENDIF 

COUNT TO PI FOR 1=1 
IF P1>0 
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? $TR(P1,3,0) 

?? 1 genes with priority = 1 (Primary analysis neededi ) • 

pTmL 0££ £ieldfi number * RFEND , L ; D, F , Z , R # C, ENTRY/ S , DESCRIPTOR , LENGTH # INIT for Is! 



♦SET PRINT OFF 
CLOSE DATABASES 
ERASE TEMPI. DBF 
ERASE TEMP2.DBF 

USE • Smar tGuy t FoxBASE+/Mac i fox 



files ; clones. dbf" 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 

USE TEMPI 

COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MARK1 = 1 

DO WHILE SW2cO ROLL 
IF MARK1 >= TOT 
PACK 

COUNT TO UNIQUE 

SW2=1 

LOOP 

ENDIF 
GO MARK1 
DOT a 1 

STORE ENTRY TO TESTA 
SW « 0 

DO WHILE SW=0 TEST 
SKIP 

STORE ENTRY TO TESTS 

IP TESTA * TESTB 

DELETE 

DUP « DUP+1 

LOOP 

ENDIF 
GO MARK1 

REPLACE RFEND WITH DUP 
MARK1 c MARXl+DUP 
SW=1 
LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
* BROWSE 

♦SET PRINTER ON 

SORT ON NUMBER TO TEMP2 

USE TQ4P2 

?? STR (UNIQUE, 4,0) 

?? • genes, for a total of 1 

?? STR(TOT,5,0) 

?? 1 clones* 

? ' V Coincidence 1 

list off fields nuniber,RraTO,L,D,F,Z,R,C,3^^ 

*SET PRINT OFF 
CLOSE DATABASES 
ERASE TEMPI .DBF 
ERASE TEMP2 .DBF 

USE i StnartGuyjFoxBASE+/toO!£ox files: clones. dbf 1 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 
USE TEMPI 

COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MARKL = 1 

SW2=0 

DO WHILE SW2=0 ROLL 
IF MARK1 >* TOT 
PACK 

COUNT TO UNIQUE 

COUNT TO NEW3ENES FOR D='H' .OR.Ete'O' 

SW2cl 

LOOP 

EKDIF 
GO MARK1 
COP » 1 

STORE ENTRY TO TESTA 
SWo 0 

DO WHILE SW=0 TEST 
SKIP . 
STORE ENTRY TO TESTS 

IF TESTA = TESTB 

DELETE 

DUP r DUP+1 

LOOP 

ENDIF 
GO MARK1* 

REPLACE RFEND WITH DUP 
MARK1 « MARK1+DUP 
SW«1 
LOOP 

EMDDO TEST 
LOOP 

ENDDO ROLL 
GO TOP 

STORE R TO FUNC 
USE "Analysis f unction, dbf" 
LOCATE FOR P=FUNC 
•REPLACE CLONES WITH TOT 
REPLACE GENES WITH UNIQUE 
REPLACE NEW WITH NEWGENES. 
USE TEMPI 

SORT ON RFEND/ D TO TEMP2 

USE TEMP2 

SET HEADING ON 

?? STR (UNIQUE/ 5/0) 

?? » genes, for a total of 1 

?? STR(TOT,5#0) 

?? ' clones' 
*»* 

? ' V Coincidence' 

list off fields number, RFEUD, L, D, F, Z, R , C , ENTRY / S , DESCRIPTOR , LENGTH, INIT, I 
wan 

* SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva \ 13 COLOR 0,0, 
♦list Cff fields RFEND, S, DESCRIPTOR 

♦SET PRINT OFF 
CLOSE DATABASES 
ERASE TE££P1 . DBF 
ERASE TE^?2.DBF 
USE TEMPDESIG 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 
USE T34P1 
COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MARK1 * 1 

SW2«0 

CO WHILE SW2=0 ROLL 
IF MARK1 >» TOT 
PACK 

COUNT TO UNIQUE 

SW2=1 

LOOP 

ENDIF 
GO MAR2Q 
DUP * 1 

STORE ENTRY TO TESTA 
SW = 0 

DO WHILE SWrO TEST 
SKIP 

•STORE ENTRY TO TESTB 

IF TES TA « TESTS 

DELETE 

PUP = DUP+1 

LOOP 

ENDIP 
GO MARK1 

REPLACE RPEIO WITH TOP 
MARK1 = MARKl+DUP 
SW=1 
LOOP 

ENDDO TEST 
LOOP 

EMDDO ROLL . *; ■ 
GO TOP 

STORE P TO DIST 

USE "Analysis distribution, dbf " 
LOCATE FOR P=DIST 
REPLACE CLONES WITO TOT 
REPLACE GENES WITO UNIQUE 
USE TEMPI 

sort on rfend/d to TEMP2 

USE TO4P2 

?? STR (UNIQUE, 5,0) 

?? 1 genes, for a total of 1 

?? STR(TOT # 5,0) 

?? » clones 1 

? 1 V Coincidence' 

list off fields number, RPE^ # L # D # P # Z,R,C,E?TIOT < S,DESraiPIOR,LEt^H,IfnT # I 

*SET PRINT OFF 
CLOSE DATABASES 
ERASE SEMP1.BBF 
. ERASE TEMP2.DBF 
USE TEMPDESIG 



73 



WO 95/20681 



PCT/US95/01160 



* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 

USB TEMPI 

COUNT 10 TOT 

REPLACE ALL RFEND WITH 1 

MARKl o 1 

SW2-0 

DO WHILE SW2=0 ROLL 
IF MARKl >- TOT 
PACK 

COUNT TO UNIQUE 

SW2b1 

LOOP 

ENDI? 
GO MARKl 
DUP » 1 

STORE ENTRY TO TESTA 
SW « 0 

DO WHILE SW=0 TEST 
SKIP 

STORE ENTRY TO TESTS 

IF TESTA e TESTS 

DELETE 

DUP .= DUP+1 

LOOP 

ENDIF 
GO MARKl 

REPLACE -RFEND WITH DUP 
MARKl = MAHK1+DUP 
SW=1 
LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL ' 

GO TOP 

USE TEMPI 

?? STR (UNIQUE, 5,0) 

?? ' erenfts, for a total of ■ 

?? STR(TOT,5,0) 

?? ' clones' 

' 1 V Coincidence' 

list Off fields number, RFEND, L, D,F, Z, R, C, EOTRY, S, DESCRIPTOR, LENGra, INIT, I 

*SET PRINT OFF 
CLOSE DATABASES 
ERASE TEMPI, DBF 
USE TEMEDESIG 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 
USE "SrciertGuy:FoxB^E+/Mac:fox files:Clones.dbf ■ 
COPY TO TEMPI FOR 
USE TEMPI 

COUNT TO IDGENE FOR D»'E' .OR.D» 'O 1 .OR.D= i H i .OR.D=*N' .ORtDs'R 1 .OR.Ds ' A 1 

D^TS FOR D='N» .OR.D='D' .OR.D-'A' .OR.D='U' .OR.Dn'S 1 .OR.D='M' .OR.D*'R' .OR.D=*V 

COUNT TO TOT 

REPLACE AIL RFEND WITH 1 

MARKl = 1 

SW2=0 

DO WHILE SW2=0 ROLL 
IF MARKl >= TOT 
PACK 

COUNT TO UNIQUE 

SW2=1 

LOOP 

ENDIF 
GO MARKl 
DUPb 1 

STORE ENTRY TO TESTA 
SW = 0 

DO WHILE SVfeO TEST 
SKIP 

STORE ENTRY TO TESTB 

IF TESTA = TKSTB 

DELETE" 

DUP c DUP+1 

LOOP 

ENDIF 
GO MARKl 

REPLACE RFEND WITH DUP 
MARKl x MARKl + DUP 
SW=1 
LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
* BROWSE 

♦SET PRINTER ON 

SORT ON RFEND/D, NUMBER TO TEM?2 
USE TEMP2 

REPLACE ALL START WITH RFEND/IDGENE*10000 

?? STO {UNIQUE, 5,0) 

?? ' genes, for a total of ' 

?? STOCK)?, 5,0) 

?? ' clones* 

? 1 Coincidence V V Clones/10000' 

set heading off 

SCREEN 1 TYPE 0 HEADING -Screen 1' AT 40,2 SIZE 286,492 PIXELS FOOT -Geneva", 7 COLOR 0,0,0, 

list fields number, FFHTO, START, L,D,F,Z,R,C,EOTRY,S, DESCRIPTOR, INIT, I 

♦SET PRINT OFF 

CLOSE DATABASES 

ERASE TEMPI. DBF 

ERASE TEMPS .DBF 

USE *SmartGuy:FoxBASEt/Mac:fox f iles: clones. dbf * 
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* COMPRESSION SUBROUTINE FOR ANALYSIS PROGRAMS 
USE TEMPI 

COUNT TO IDGENE FOR D^'S 1 .OR.D='0' ,0RiD=*H* <0R«Ds*N' •OR.D='R' .OR.Dk'A' 

DELETE FOR D«'N' .OR.D='D' .OR.Ds' A' .OR.D= 'U' .OR.D^'S 1 .OR.Dc'M 1 *OR.D='R' .0R.D='V' 

PACK 

COUNT TO TOT 

REPLACE ALL RFEND WITH 1 

MARK1 b 1* . 

SW2=0 

DO WHILE SW2=0 ROLL 
IF MARK! TOT 
PACK 

COUNT TO UNIQUE 

SW2=1 

LOOP 

ENDIF 
GO MARK1 
COP * 1 

STORE EMTRY TO TESTA 
SW« 0 

DO WHILE SW=0 TEST 
SKIP 

STORE EWTRY TO TESTB 

IF TESTA = TESTB 

DELETE 

DOT * DUP+1 

LOOP - 

ENDIF 
GO MARX1 

REPLACE RFEND WITH DUP 
MARK1 a MARX1+DUP 
SW=1 
LOOP 

ENDDO TEST 
LOOP 

ENDDO ROLL 
♦BROWSE 

♦SET PRINTER ON 

SORT ON RFEND /D, NUMBER TO TEMP2 
USE TZM?2 

REPLACE ALL START WITH RFEND/ IDGENE* 10000 

?? STR (UNIQUE, 5,0) 

?? ' genes, for a tdtal of 1 

?? STR(TOT,5,0) 

?? ■ clones' 

7 ' Coincidence V v Clones/20000 1 

eet heading off 

SCREEN 1 TYPE 0 HEAblNG "Screen 1- AT 40,2 SI22 286,492 PIXELS FONT -Geneva*, 7 COLOR 0,O,O, 

list fields nuinber, RFEND, START, L^D/P/ZjR^C, EOTRy,S, DESCRIPTOR; INIT, I 

♦SET PRINT OFF 

CLOSE DATA3ASES 

ERASE TEMPI. DBF 

ERASE TEMP2.DBF 

USB "SmartGuy:Fox2ASE+/Macjfox files (clones. dbf* 
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USE TEMPI 
COUNT TO TOT 

?? 1 Total of 1 

?? Sra(TOT,4,0) 
?? 1 clones' 
? 

*liet off fields number, L,D,f,z,r,c, entry, desc^ptor, length, rfend,init,i 
list off fields number ,L,D,F,Z,R,C, entry , descriptor 

CLOSE DATABASES 
ERASE -TEMPI. DBF 
USE TEMPDESIG 
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*Lif escan menu; version 8-7-94 

SET TALK OFF 

set device to screen 

CLEAR 

USB °Sn«rtGyy:FoxBASE+/Mac:fox f iles: clones. dbf" 
STORE LUFDATEO TO Update 
GO BOTTOM 

STORE RECNQO TO cloneno 
STORE 6 TO Chooser 
DO WHILE .T. 

* Program.: Lifeseg menu.fmt 

* Date....i 1/11/95 

.* Version.: FoxEASE+/Mac, revision 1.10 

* Notes. . . . : Format file Lifeseq menu 
* 

SCREEN 1 TYPE 0 HEADING "Screen l u AT 40,2 SIZE 286,492 PIXELS FOOT "Geneva", 258 COLOR 0,6, 
6 PIXELS 18,126 TO 77,365 STYLE 28479 COLOR 32767,-25600,-1,-16223,-16721,-15725 
6 PIXELS 110,29 TO 188,217 STOLE 3871 COLOR 0,0,-1,-25600,-1,-1 

4 PIXELS 45,161 SAY "LXFESEQ 1 STYLE 65536 FONT ♦Geneva', 536 COLOR 0,0,-1,-1,7135, 5884 

9 PIXELS 36,269 SAY "IM" STYLE 65536 FOOT -Geneve', 12 COLOR 0,0,-1,-1,7135,5884 

6 PIXELS 63,143 SAY •Molecular Biology Desktop- STYLE 65536 FONT "Helvetica" , 18 COLOR 0,0,0, 

8 PIXELS 90,252 TO 251,467 STYLE 2B447 COLOR 0,0,-1,-25600,-1,-1 

8 PIXELS 117,270 GET Chooser STYLE 65536 FONT "Chicago", 12 PICTURE "0+RV* Transcript profiles 
0 PIXELS 135,128 SAY Update SIYLE 0 FONT 'Geneva 1 , 12 SIZE 15,79 COLOR 0,0,0,-25600,-1,-1 ' 
G PIXELS 171,128 SAY cloneno STYLE 0 FOOT "Geneva', 12 SIZE 15,79 COLOR 0,0, 0, -25600, -i,-l 
G PIXELS 135,44 SAY "Last update:" STYLE 65536 FOOT <Geneva\12 COLOR 0,0,-1,-1,-1,-1 
8 PIXELS 171,44 SAY "Total Clones:* STYLE 65536 FOOT "Geneva" , 12 COLOR 0,0,-1,-1,-1,-1 
G PIXELS 45,296 SAY 'vl.30" STYLE 65536 FCOT "Geneva", 782 COLOR 0,0,-1,-1,-1,-1 

* EOF: Lifeseq menu.fmt 
READ 

DO CASE 

CASE Chooserd 

DO * Smart Guy j FoxEASE* /Mac : fox files:Output programs (Master analysis 3.prg" 
CASE Chocser=2 

DO "SmartGuyiFox3ASE+/Mac:fo>c files: Output programs i Subtraction 2.prg" 
'CASE Chooser=3 

DO "SmartGuy:FoxBASE+/Mac:fox files:Output programs : Northern (single) .prg" 

CASE Chooser=4 

USE "Libraries, dbf 

BROWSE 

CASE Chcoser«5 

DO "SmartGuy;FoxEASE+/Macifox files:Output programsiSee individual clone. prg" 
case Chooser=6 

DO ■SroartGuy:FcxBASE+/Mac:fox files i Libraries i Output programs iMenu.prg 1 

CASE Chooser=7 

CLEAR 

SCREEN 1 OFF 

RETURN 

ENDCASB 

LOOP 
ENDDO 
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©1,30 SAY "Database Subset Analysis" STYLE 

7 ' 

7 

7 

? 

7 dateO 
?? ■ 

77 TIMS<) 

? 'Clone numbers ' 

?? STR (INITIATE, 6,0) 

?? ' through • 

?? STR (TERMINATE, 6,0) 

7 'Libraries: ' 

IP ENTIRE=1 

7 'All libraries' 

ENDIF 

IF ENTIRE=2 
KAHKal 
DO WHILE .T. 
IF MARK>STO?IT 
EXIT 
ENDIF 

USE SELECTED 
GO MARK 
7 1 • 

77 TOIMUibname) 
STORE MARK+l TO MARK 
LOOP 
ENDDO 
ENDIF 

? 'Designations i • 

IF Ematch=0 .AND. Hmatch=0 .AMD. Ctnatch=0 

?? 'All' 

ENDIF 

IF Eraatch*! 
77 'Exact, 1 
ENDIF 

IF Hmatch=l 
?? 'Human, 1 
ENDIF 

IF Gmatch=l 
77 'Other sp. ' 

endif 

IF CONDENol 

? •Condensed format analysis' 

ENDIF 

IF AftOL-1 

?• 'Sorted by NUMBER' 

ENDIF 

IF ANAL=2 

? 'Sorted by ENTRY 1 

ENDIF 

IF ANAL°3 

? 'Arranged by ABUNDANCE' 

ENDIF 

IF ANAL=4 

? 'Sorted by INTEREST' 

ENDIF 

IF ANAL=5 

? 'Arranged fcy LOCATION' 

ENDIF 

IF ANfeL*5 

7 'Arranged by DISTRIBUTION' 

ENDIF 

IF ANAL=7 

7 ' Arranged by FUNCTION* 



.FONT "Geneva 0 , 274 COLOR 0,0,0,-1,-1,-1 
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ENDIF 

? •Total clones represented; 

77 STR (STARTOT, 6,0) 

? 'Total clones analyzed! 1 

?? STR(AM?kLTOT, 6/ 0) 

? 

? 
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USE TEMPI 
COUNT TO TOT 
?? • Total of 
?? sra{TOT,4,0> 
?? ' clones' 
7 

•list o'ff fields nuiriber,L,D t F,Z,R,C,EtmiY,D^^ 

list off fields number#L/D,F/ Z,R,C,ETOHy, DESCRIPTOR 

CLOSE DATABASES 
ERASE TEMPI -DBF 
USE TEMPDESIG 
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USE TEMPI 

courqT to tot 
?? 1 Total of 1 

?? STR(TOT,4,0) 
?? 1 clones! 
? . 

*list off fields number, L,D f P f Z f R ( C,EOTRV,DSSCRIPTOR,imOT,RFEND,INIT. I 
list off fields number, L,D, F, Z,R,C, ENTRY, DESCRIPTOR 
OjOSE DATABASES 
ERASE TEMPI, DB? 
USE TEMPDSSIG 
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♦Northern (single), version 11-25-94 

close databases 

SET TALK OFF 

SET PRINT 0?? 

SET EXACT OFF 

CLEAR 

STORE ■ ' TO Eobject 

STORE 1 * TO Dobject 

STORE 0 TO Nturib 
STORE 0 TO Zog 
STORE 1 TO Bail 
DO WHILE .T. 

* Program.: Northern (single). fmt 

* Date....: 8/ 8/94 

* Version,: FoxBASE*/Mao, revision 1.10 

* Notes....: Format file Northern (single) 
« 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SI2E 286,492 PIXELS FOOT "Geneva", 12 COLOR 0,0,0 
@ PIXELS 15,81 TO 46,39? STYLE 28447 COLOR 0,0,-1,-25600,-1,-1 
0 PIXELS 89,79 TO 192,422 STYLE 2B447 COLOR 0,0,0,-25600,-1,-1 
G PIXELS 115,98 SAY "Entry #:■ STYLE 65536 FONT "Geneva", 12 COLOR 0,0,0,-1,-1,-1 
<a PIXELS 115.173 GET Eobject STYLE 0 FONT "Geneva" r 12 SIZE 15,142 COLOR 0,0,0,-1,-1,-1 
Q PIXELS 145,89 SAY "Description" STYLE 65536 FONT "Geneva", 12 COLOR 0,0,0,-1,-1,-1 
<3 PIXELS 145,173 GET Dobject STYLE 0 FONT 'Geneva", 12 SIZE 15,241 COLOR 0,0,0,-1,-1,-1 
Q PIXELS 35,89 SAY "Single Northern search screen" STYLE 65536 FONT -Geneva", 274 COLOR 0,0,- 
@ PIXELS 220,162 GET Bail STYLE 65536 FONT "Chicago", 12 PICTURE "3*R Continue; Bail out' SIZE 
© PIXELS 175,98 SAY "Clone #:" STYLE 65536 FONT "Geneva";12 COLOR 0,0,0,-1,-1,-1 
@ PIXELS 175,173 GET Numb STYLE 0 FOOT "Geneva", 12 SIZE 15,70 COLOR 0,0,0,-1,-1,-1 * 
•@ PIXELS 80,152 SAY 'Enter any ONE of the following:" STYLE 65536 FONT "Geneva 1 , 12 COLOR -1, 

* EOF: Northern (single). frot 
HEAD 

IF Bail«2 
CLEAR 

screen 1 off 

R3TORN 

ENDIF 

USE " Smart Guy : FoxBASE* /Mac : Fox files : Lookup. dbf u 
SET TALK 'ON 

IF Eobjecto' . • 

STORE UPPER(Eobject). to Eobject 

SETT SAFETY OFF 

SORT ON Entry TO "Lookup entry, dbf" 

SET SAFETY ON 

USE "Lookup entry, dbf 

LOCATE FOR Lookrffcbject 

IF .NOT.FOUNDO 

CLEAR 

LOOP 

ENDIF 

BROWSE 

STORE Entry TO Searchvjal 

CLOSE DATABASES 

ERASE "Lookup • entry . dbf " 

ENDIF 

IF Dobjecto 1 1 
SET EXACT OFF 
SET SAFETY OFF 

SORT ON descriptor TO "Lookup descriptor, dbf * 
SET SAFETY On 

USB "Lookup descriptor. dbf • 

LOCATE FOR UPPER (TRIM (descriptor) ) &UPPER (TRIM ( Dobj ect ) ) 

IF .NOT.FOUNDO 

CLEAR 
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LOOP 

ENDIF 

BROWSE 

STOEE Entry TO Searchval 

CLOSE DATABASES 

ERASE "Lookup descriptor .dbf u 

SET EXACT ON 

ENDIP 

IP Ntob<>0 

USE ' SmartGuy : FoxBASE* /Mac : Fox files: clones. dbf ■ 

GO Kuzob 

BROWSE 

STORE Entry TO Searchval 
ENDIP 

CLEAR 

? 'Northern analysis for entry ' 

?? Searchval 

o 

? 'Enter V to proceed' 

WAIT TO OK 

CLEAR 

IP UPPER (OK) o'Y' 
screen 1 off 
RETURN 
ENDIP * 

* COMPRESSION SUBROUTINE FOR Library, dbf 
? 'Compressing the Libraries file now,,. 1 

USE ■ SmartGuy : FoxBASE+ /Mac i Fox files: libraries. dbf 
SET SAFETY OFF 

SORT CM library TO "Compressed libraries. dbf " 

* FOR entered>0 
SET SAFETY ON 

US E 'C oinpressed libraries .dbf 0 

DELETE FOR entered- 0 

PACK 

COUNT TO TOT* 
MARK1 n 1 
SW2uO 

CO WHILE SW2=0 ROLL 

IF MARK1 >* TOT 

PACK 

SW2=1 

LOOP 

ENDI? 
GO MARK1 

STORE library TO TESTA 
SKIP 

STORE Library TO TESTB 
IF TESTA = TESTB 
DELETE 
ENDIF 

MARK1 n MARKl+l 
LOOP 

ENDDO ROLL 

* Northern analysis 
CLEAR 

? 'Doing the northern now. . . 1 
SET TALK ON 

USB "SmartGuy: FoxSASE*/Mae:Fox files s clones. dbf " 
SET SAFETY OFF 

COPY TO "Hits. dbf " FOR entry* searchval 
SET SAFETY ON 
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CLOSE DATABASES 
SELECT 1 

USB •Ccnpressed libraries. dbf 

STORE RECOOUNT<) TO Entries 

SELECT 2 

USE ■Hits.dbf" 

Marte=l 

CO WILE .T. 

SELECT 1 

IF Mark>Entries 

EXIT 

ENDIF 

GO MARK 

STORE library TO Jigger 
SELECT 2 

COUNT TO Zog FOR library^ Jigger 
SELECT 1 

REPLACE hits with Zog 

Mark=Mark+l 

LOOP 

EMDDO ' 

SELECT 1 

BROWSE FIELDS LIBRARY, LIBNAME, ENTERED, HITS AT 0,0 
CLEAR 

? 'Enter Y to print: 1 

WAIT TO FRINSET 

IF UPPER ( FRINSET) = 1 Y 1 

SET PRINT ON 

CLEAR 

HIECT- 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT "Geneva", 14 COLOR 0,0,0 

? 'DATABASE ENTRIES MATCHING EMERY ' 

?? Searchval 

? DATE () 

? 

SCREEN 1 TYPE 0 HEADING "Screen 1" AT 40,2 SIZE 286,492 PIXELS FONT 'Geneva° # 7 COLOR 0,0,0, 

LIST OFF FIELDS library, libnarcs, entered, hits 

? 

? 

SELECT 2 

LIST OFF FIELDS NUMBER, LIBRARY, D, S,F, Z, R,2OTRY, DESCRIPTOR, RFSTART, START, RFEND 
SET TALK OFF 
SET PRINT OFF 
ENDIF 

CLOSE DATABASES 
SET TALK OFF 
CLEAR 

DO 'Test print ,prg B 
RETURN 
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TABLE 6 



library libname 

ADENINB01 Inflamed adenoid 

ADRENOR01 Adrenal gland (r) 

ADRENOTO1 Adrenal gland (T) 

AMLBMOT01 AML blast cells (T) 

BMARNOT01 Bone marrow 

BMARNOT02 Bone marrow (T) 

CARDNOT01 Cardiac muscle (T) 

CHAONOT01 Chin, hamster ovary 

COANNOTD1 Comeel stroma 

FI3RA0TD1 Fibroblast, AT 5 

FIBRAGT02 Fibroblast. AT 30 

FlBnANTOI Fibroblast, AT 

FI2HNGT01 Fibroblast, uv 5 

FIBRNQTO Fibroblast, uv 30 

FIBRNOT01 Fibroblast 

R2RNOT02 Fibroblast, normal 

HMC1NOT01 Masl cell line HMC-1 

HUVELPB01 HUVEC 1FNJNF, LPS 

HUVENO801 HUVEC control 

HUVESTB01 HUVEC 9hear stress 

HYFONOB01 Hypothalamus 

KIONNOT01 Kidney (T) 

UVRMOT01 Liver (T) 

LUN6NOTQ1 Lung fT) 

MUSCNOT01 Skeletal muarie (T) 

OVIDNOB01 Oviduct 

PANCNOTOl Pencreas, normal 

FfTUNOROl Pituitary (r) 

PITUNOT01 Pituitary (7) 

PLACNOB01 Placenta 

SINTNOT02 Small intestine (T) 

SPIMFCT01 Spleentliver, fete! 

SPLNNOT02 Spleen (7) 

STOMNOT01 Stomach 

6YNORAB01 Rheum, synovium 

TBLYNOTD1 T + B lymphoblast 

TESTNOTOl Testis fT) 

THP1NOB01 THP-1 control 

THP1PEB01 THP phorbol 

THP1PLB01 THP-1 phorbol LPS 

U937NOT01 U937, monocytic leuk 



number library 

2304 U837NOT01 

3240 HMC1NOT01 

3269 HMC1NOT01 

4€93 HMC1NOT01 

8989 HMC1NOT01 

9139 HMC1NOT01 



d a f 2 r entry 

E H C C T HUMEF-1B 
E H C C T HUMEF1B 
E H C C T HUMEFlB 
E H C C T HUMEFlB 
EHCCT HUMEFlB 
E H C C T HUMEF1B 



descriptor 
Elongation fador 1*beta 
Elongation (actor 1-beta 
Elongation factor 1-beta 
Elongation factor 1-beta 
Elongation iacior i-beta 
Elongation factor 1-beta 



rfetanatert rfend 

n- 0 773 

0 370 773 

0 371 773 

0 470 773 

0 327 773 

0 375 773 
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WHAT IS CLAIMED IS; 

1. A method of analyzing a specimen containing gene 
transcripts, said method comprising the steps of: 

(a) producing a library of biological sequences; 
5 (b) generating a set of transcript sequences, where 

each of the transcript sequences in said set is indicative 
of a different one of the biological sequences of the 
library; 

(c) processing the transcript sequences in a 

10 programmed computer in which a database of reference 

transcript sequences indicative of reference biological 
sequences is stored, to generate an identified sequence 
value for each of the transcript sequences, where each said 
identified sequence value is indicative of a sequence 

15 annotation and a degree of match between one of the 

transcript sequences and at least one of the reference 
transcript sequences; and 

(d) processing each said identified sequence value to 
generate final data values indicative of a number of times 

20 each identified sequence value is present in the library. 

2. The method of claim 1, wherein step (a) includes 
the steps of: 

obtaining a mixture of mRNA; 

making cDNA copies of the mRNA; 
25 isolating a representative population of clones 

transfected with the cDNA and producing therefrom the 
library of biological sequences. 

3. The method of claim 1, wherein the biological 
sequences are cDNA sequences. 

30 4. The method of claim 1, wherein the biological 

sequences are RNA sequences. 

5. The method of claim 1, wherein the biological 
sequences are protein sequences. 
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6. The method of claim 1, wherein a first value of 
said degree of match is indicative of an exact match, and a 
second value of said degree of match is indicative of a 
non-exact match. 

5 7. A method of comparing two specimens containing 

gene transcripts, said method comprising: 

(a) analyzing a first specimen according to the 
method of claim 1; 

(b) producing a second library of biological 
10 sequences; 

(c) generating a second set of transcript sequences, 
where each of the transcript sequences in said second set 
is indicative of a different one of the biological 
sequences of the second library; 

15 ' (d) processing the second set of transcript sequences 

in said programmed computer to generate a second set of 
identified sequence values known as further identified 
sequence values, where each of the further identified 
sequence values is indicative of a sequence annotation and 

20 a degree of match between one of the biological sequences 
of the second library and at least one of the reference 
sequences; 

(e) processing each said further identified sequence 
value to generate further final data values indicative of a 

25 number of times each further identified sequence value is 
present in the second library; and 

(f) processing the final data values from the first 
specimen and the further identified sequence values from 
the second specimen to generate ratios of transcript 

30 sequences, each of said ratio values indicative of 

differences in numbers of gene transcripts between the two 
specimens. 

8. A method of quantifying relative abundance of mRNA 
in a biological specimen, said method comprising the steps 
35 of: 

(a) isolating a population of mRNA transcripts from 
the biological specimen; 
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(b) identifying genes from which the mRNA was 
transcribed by a sequence-specific method; 

(c) determining numbers of mRNA transcripts 
corresponding to each of the genes; and 

5 (d) using the mRNA transcript numbers to determine 

the relative abundance of mRNA transcripts within the 
population of mRNA transcripts. 

9. A diagnostic method which comprises producing a 
gene transcript image, said method comprising the steps of: 

10 (a) isolating a population of mRNA transcripts from a 

biological specimen ; 

(b) identifying genes from which the mRNA was 
transcribed by a sequence-specific method; 

(c) determining numbers of mRNA transcripts 
15 corresponding to each of the genes; and 

(d) using the mRNA transcript numbers to determine 
the relative abundance of mRNA transcripts within the 
population of mRNA transcripts, where data determining the 
relative abundance values of mRNA transcripts is the gene 

20 transcript image of the biological specimen. 

10. The method of claim 9, further comprising: 

(e) providing a set of standard normal and diseased 
gene transcript images; and 

(f) comparing the gene transcript image of the 

25 biological specimen with the gene transcript images of step 
(e) to identify at least one of the standard gene 
transcript images which most closely approximate the gene 
transcript image of the biological specimen. 

11. The method of claim 9, wherein the biological 
30 specimen is biopsy tissue, sputum, blood or urine. 

12. A method of producing a gene transcript image, 
said method comprising the steps of 

(a) obtaining a mixture of mRNA; 

(b) making cDNA copies of the mRNA; 
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(c) inserting the cDNA into a suitable vector and 
using said vector to transfect suitable host strain cells 
which are plated out and permitted to grow into clones, 
each clone representing a unique mRN A ; 
5 (d) isolating a representative population of 

recombinant clones; 

(e) identifying amplified cDNAs from each clone in 
the population by a sequence-specific method which 
identifies gene from which the unique mRNA was transcribed; 
10 (f) determining a number of times each gene is 

represented within the population of clones as an 
indication of relative abundance; and 

(g) listing the genes and their relative abundance in 
order of abundance, thereby producing the gene transcript 
15 image, 

13. The method of claim 12, also including the step 
of diagnosing disease by: 

repeating steps (a) through (g) on biological 
specimens from random sample of normal and diseased humans, 
20 encompassing a variety of diseases, to produce reference 
sets of normal and diseased gene transcript images; 

obtaining a test specimen from a human, and producing 
a test gene transcript image by performing steps (a) 
through (g) on said test specimen; 
25 comparing the test gene transcript image with the 

reference sets of gene transcript images; and 

identifying at least one of the reference gene 
transcript images which most closely approximates the test 
gene transcript image. 



30 14. A computer system for analyzing a library of 

biological sequences, said system including: 

means for receiving a set of transcript sequences, 
where each of the transcript sequences is indicative of a 
different one of the biological sequences of the library; 

35 and 

means for processing the transcript sequences in the 
computer system in which a database of reference transcript 
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sequences indicative of reference biological sequences is 
stored, wherein the computer is programmed with software 
for generating an identified sequence value for each of the. 
transcript sequences, where each said identified sequence 
5 value is indicative of a sequence annotation and a degree 
of match between a different one of the biological 
sequences of the library and at least one of the reference 
transcript sequences, and for processing each said 
identified sequence value to generate final data values 
10 indicative of a number of times each identified sequence 
value is present in the library. 



15. The system of claim 14, also including: 
library generation means for producing the library of 

biological sequences and generating said set of transcript 
15 sequences from said library. 

16. The system of claim 15, wherein the library 
generation means includes: 

means for obtaining a mixture of mRNA; 

means for making cDNA copies of the mRNA; 

20 means for inserting the cDNA copies into cells and 

permitting the cells to grow into clones; 

means for isolating a representative population of the 

clones and producing therefrom the library of biological 
sequences* 
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AxJIp ssquoioe tosdwing Ser 209 end occurs wtthri 
the ccmainof Aadlp that shows homology with hDE 
(74). To delete the complete STE23 sequence and 
create the *ts23&vtiR43 nutation, polymerase chain 
reaction (PCR) primers (5'-TCG GAAGACCTCAT* 
TCTTGCTCATTTTGATATTGCTO- TGTAGATTG- 
TACTGAGkAGTGCAC-3' ; and 5'-GCTACAAACAGC- 
GTCGACT TGAAT6CCCCGACATCT TCGACTGT- 
GCGGTATTTCACAOCG-3') wore used to empSfy 
the URA3 sequence of pR53l6. and the reaction 
product was franstormed Wo yeast lor one-step gene 
replacement [R. Rothertein, Methods i Emymol. 194. 
281 (l991^Tocr8atethe*rfJAvL£Lff mmattan con- 
toned on p1 14, a 5D-kb Sal I fragment from pAXL J 
was cloned Ho pUCl 9. and en interna* AJChkb Hpa 
WOx) I fragment was replaced wtth a LBJ2 fragment. 
To construct the stB23A*i.BJ2 a»e*e (a delation cor- 
i e&pondno to 931 amino acids) carried on pi53, e 
LBJ2 fragment was used to replace the 2.8-kb Pmt 
l-£dl36 1 fragment of STE23. wfifch occurs within a 
&2-W> Hnd W-Bgi tl genomic freemenr carried on 
pSP72 (Promega). To create YEpMFAl, a 1.6-kb 
BarnHllraynertccfitaWr^ArW from pKKi6 |K. 
Kuchter. R E Sterne. J. Thcmer. BMBOJ. 8, 3973 
09891], was figaled into the Bam HI sfteof>Hp351 p. 
EH3.A.M. Myers. T.J, Koerner. A. TzagotofT. Yeast 
2.163(1966)]. 

24. J. Chant and L Herskowttz, Geff 65. 1203 (1991). 

25. B. W. Matthews. Acc Chem. fles. 21 . 333 (1988). 

26. K. Kuchter, H. G. DoWman, J. Thorner; J, Cef BioL 
120, 1203 (1993); a Kofeng and C. P. HoUenberg. 
BMBOJ. 13. 3261 (1994); C. Berfcower, D. Loeyza, 
S. Michaels, Wo/, fib/. Ca* 6, 1 185 (1994). 

27. A. Bender and J. R. Prr^te. fVoc. M*fi .Acad. Sa 
USA 86, 9976 (1989); J. Chant, K. Corrado, J. R. 
Pringte, I. Herskowttz. Cat 65, 1213 (1991); S. 
Powers, E. Gonzales, T. Christensea J. Cubert, D. 
Broek. fcto.. p. 1225; H. O. Park. J. Chant. I. Her- 
akowitz, Nature 365. 269 (1993); J. Chant. Trends 

Genet 1 0, 328 (1994); and J. R Pringle. J. 

OS Bid, 129. 751 (1995); J. Chant, M. Mschke, E. 
Mitchell. L Herskowttz, J. R Pringle, bki. t p. 767. 

28. G. F. Sprague Jr., Methods. EnzymoL 194, 77 
(1991). 

29. Single-letter abbreviations for the amino acid resi- 
dues are as foflows: A, Ala; C. Cys; O, Asp; E, Gkj; F, 
Phe; G. G»y; K His; I, Se; K. Lys; L. Leu: M, Met: N. 
Asrv P, Pro; O. Gin; fl, Arg; S, Ser. T. Thr; V, VaJ; W, 
Trp;andY.Tyr. 

30. A W303 1A derivative. SYZE25 (MATa tn3-1 leuS-3. 
I12trp1-l aae2'icani*i00$sti& n*a2teF\J$UlxZ 

search. SV2625 derivatives for the mating assays, as- 
crated pheromone assays, and the putse-chase exper- 
iments included tie toftowfrq stains: Y49 tsta22-1), 
Y115 (mte7A.*.tajG5. Y142 frt1:.tJRA3). Y173 
laxfJ AriRJE). Y220 1&11L1JRA3 SIB23A.7LRA3). Y22 1 
(sta23^.-UH43). Y231 1&1L'.±BJ2 ste23AzLBJ2). 
and Y233 &e236r±£U2l MAT a derivatives of 
SY2625 included the following strains: Y199 
(SY2625 made HATa], Y278 (sfe22-7), Y195 
(mfelAstR/?). Y196 (ax/IA.vL£U2), and Y197 
(atft.vLWAS). The EG 123 (U47a Ieu2 ur^3 trpl cam 
his4) genetic background was used to creatB a set of 
strains far analysis of bud site selection. EG 123 de- 
rivatives- included the folowing strains: Y175 
<ax/1A.7L£U2), Y223 l*xi1:.VRA3). Y234 isto23&:: 
IBJZI and Y272 fpd1&::LEU2 ste23£L-:LEU2). 
MATo derivatives of EG 123 Included the folowing 
strains: Y214 (EG 123 made AM To) and Y293 
(atf1A.7l£U2). All strains were generated by means 
of standard genetic or molecular methods Involving 
the appropriate constructs (23). In particular, the axil 
ste23 double mutant strains were creeled by cross- 
ing of the appropriate AM 7a ste23 and A447o vd1 
mutants, followed by sporutatlon of the resuftant Op- 
ioid and isolation of the double mutant from nonpe- 
rental d-type tetrads. Gene disruptions were con- 
firmed with either PCR or Southern (DMA) analysis. 
31. p129 is a YEp352 jJ. E. HI, A M, Myerj, T. J. Ko- 
erner. A. Tzagoloff. Yeas' Z 163 (1986)) plasrnid con- 
taring a 5.5-kb Sal I fragment of pAXLf. p151 was 
derived from pi 29 by reertton of a Inker at the Bgf I 
site wrthhAXL J, which led to an irt-frame insertion of 
trie hemagglitiriln (HA) epitope 
between amr» adds B54 and 655 of the AXLf prod- 



uct pC225 ts a KS+ (Stratagene) ptasmd contartng 
a 0.5-kb Bam f-O-Sst I fragment from pAXL ? . Substi- 
tution nxxations of the proposed ectrve site of Axtlp 
were gaated with the use of pC22S and srte-specrfc 
nwlagenesis rwofving appropriate synthetic cegonu- 
cieotidas \&dUH6&A, 5 ' -GTGCTCACAAAGCGCT- 
GCOAACCGGC-3'; axf1-£7lA, 5'<AAGAATCAT- 
GTGCGCACAAAGGTGCGW; and wd1*7l0, 5'- 
AAGAA7T^TGTGAT^CAAAGGTGOGf>3'). The 
niutations ware ccnfjrrned by Gequence anarysis. Af- 
ter mutagenesis, the 0.4-kb Bam HJ-Msc I fragment 
from the injtagenized pC225 piasrrtds was trans- 
ferredintopAXL? to create a set of pRS3l6pbsmkte 
carrying diftarent AXL1 alleles. p124 (ajrf;-W684) 
P130 1&n-€7lA), and p132 (a«ff-£7lC). Simtoty. a 
srt of KA-tagged aleles carried on YEp352 were cn> 
ated after replacement of the pl5l Bam HWvtec I 
fragment, to generate pi 61 &H-E71A), pl62(axfJ- 
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Quantitative Monitoring of Gene Expression 
Patterns with a Complementary DNA Microarray 

Mark Schena,* Dari Shalon/t Ronald W. Davis, 
Patrick O. Brown* 

A high-capacity system was developed to monitor the expression of many genes In 
parallel. Microarrays prepared by high-speed robotic printing of complementary DNAs on 
glass were used for quantitative expression measurements of the corresponding genes 
Because of the small format and high density of the arrays, hybridization volumes of 2 
microlrters could be used that enabled detection of rare transcripts In probe mixtures 
derived from 2 micrograms of total cellular messenger RNA. Differential expression 
measurements of 45 Arabidopsis genes were made by means of simultaneous, two-color 
fluorescence hybridization. w^o«r 



The temporal, developmental, topographi- 
cal, histological, and physiological patterns 
in which a gene is expressed provide clues to 
its biobgical role. The large and expanding 
database of complementary DNA (cDNA) 
sequences from many organisms (J) presents 
the opportunity of defining these patterns at 
the level of the whole genome. 

For these studies, we used th c small flow- 
ering plant Arobidopsis dudiana as a model 
organism. Arabidopsis possesses many ad- 
vantages for gene expression analysis, in- 
cluding the tact that it has the smallest 
genome of any higher eukaryote examined 
to date (2). Forty-five cloned Arabidopiis 
cDNAs (Table 1), including 14 complete 
sequences and 31 expressed sequence tags 
(ESTs), were used as gene-specific targets. 
We obtained the ESTs by selecting cDNA 
clones at random from an Arabidopsis 
cDNA library. Sequence analysis revealed 
that 28 of the 31 ESTs matched sequences 
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in the database (Table 1). Three additional 
cDNAs from other organisms served as con- 
trols in the experiments. 

The 48 cDNAs, averaging -1.0 kb, 
were amplified with the polymerase chain 
reaction (PCR) and deposited into indi- 
vidual wells of a 96-well microtiter plate. 
Each sample was duplicated in two adja- 
cent wells to allow the reproducibility of 
the arraying and hybridization process to 
be tested. Samples from the microtiter 
plate were printed onto glass microscope 
slides in an area measuring 3.5 mm by 5.5 
mm with the use of a high-speed arraying 
machine (3). The arrays were processed by 
chemical and heat treatment to attach the 
DNA sequences to the glass surface and 
denature them (3). Three arrays, printed 
in a single lot. were used for the experi- 
ments here. A single microtiter plate of 
PCR products provides sufficient material 
to print at least 500 arrays. 

Fluorescent probes were prepared from 
total Arfliadopjii mRNA (4) by a single 
round of reverse transcription (5). The Ara- 
bidopsis mRNA was supplemented with hu- 
man acetylcholine receptor (AChR) mRNA 
at a dilution of 1 : 10,000 (w/w) before cDNA 
synthesis, to provide an internal standard for 
calibration (5). The resulting fluorescently 
labeled cDNA mixture was hybridized to an 
array at high stringency (6) and scanned 

467 



arms* 



with a laser (3). A high-sensitivity scan gave 
signals that saturated the detector at nearly 
all of the Arabidopsis target sites (Fig. 1A). 
Calibration relative to the AChR mRNA 
standard (Fig. 1A) established a sensitivity 
limit of - 1 : 50,000. No detectable hybridiza- 
tion was observed to either the rat glucocor- 
ticoid receptor (fig. 1A) or the yeast TRP4 
(fig. 1A) targets even at the highest scan- 
ning sensitivity. A moderate-sensinviry scan 



A High sensitivity 

1 2 345 67 6 9 10 11 12 

a * 'jk - . C <> i> v v ' * : ; * 



of the same array allowed linear detection of 
the more abundant transcripts (fig. ID). 
Quantitation of both scans revealed a range 
of expression levels spanning three orders of 
magnitude for the 45 genes tested (Table 2). 
RNA blots (7) for several genes (fig. 2) 
corroborated the expression levels measured 
with the microanay to within a factor of 5 
(Table 2). 

Differential gene expression was investU 



B Moderate sensitivity 

1 2 3 4 5 6 7 6 9 10 11 12 
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D HAH transgenic 

' 2 3 fl 5 6 r 6 9 10 tl 12 
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E Root tissue 

12 34567 69 10 11 12 



F Leal tissue 
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gated with a simultaneous, two-color hy- 
bridiiation scheme, which served to mini- 
mise experimental variation inherent in the 
comparison of independent hybridizations. 
Fluorescent probes were prepared from two 
mRNA sources with the use of reverse tran- 
scriptase in the presence of fluorescein- and 
lissamine-labeled nucleotide analogs, re- 
spectively (5). The two probes were then 
mixed together in equal proportions, hy- . 
bridized to a single array, and scanned sep- 
arately for fluorescein and lissamine emis- 
sion after independent excitation of the two 
fluorophores (3). 

To test whether overexpression of a sin- 
gle gene could be detected in a pool of total 
Arabidopsis mRNA, we used a microanay to 
analyze a transgenic line overexpressing the 
single transcription factor HAT4 (8). Fluo- 
rescent probes representing mRNA from 
wild-type and HAT4-transgenic plants were 
labeled with fluorescein and lissamine, re- 
spectively; the two probes were then mixed 
and hybridized to a single array. An intense 
hybridization signal was observed at the 
position of the HAT4 cDNA in the lissa- 
mine-specific scan (Fig. ID), but not in the 
fluorescein-specific scan of the same array 
(Fig. 1C). Calibration with AChR mRNA 
added to the fluorescein and lissamine 
cDNA synthesis reactions at dilutions of 
1:10,000 (Fig. 1C) and 1:100 (Fig. ID), 
respectively, revealed a 50-fold elevation of 
HAT4 mRNA in the transgenic line rela- 
tive to its abundance in wild-rype plants 
(Table 2). This magnitude of HAT4 over- 
expression matched that inferred from the 
Northern (RNA) analysis within a factor of 
2 (Fig. 2 and Table 2). Expression of all the 
other genes monitored on the array differed 
by less than a factor of 5 between HAT4- 
transgenic and wild-type piano (Fig 1, C 



O O O t - 



<J v <» *• 



1:1.000 



1:10.000 



Fig. 1 . Gene expression monrtored with the use of cDNA rrtcroarrays. Fluorescent scans represented in 
pseuoocotor correspond to hybndizaton imensfties. Color bars were calibrated from the signal obtained 
wrththeuseofkrK^c 
tetters™ the axes rr^tr* 

v^fhjoresceirviabeied cDNA derived trom wild-type plants. (B) Same array as in (A) but seamed at 
moderate sensrtMty. (C and D) A single array was probed with a 1 : 1 mixture of fluoresce*, labeled cDNA 
trom wild-type plants and bssamine-iabeled cDNA from HAT4 -transgenic plants. The single array was 
^. 6C ^Ji! CC f^ y t °. d0teCl florescence corresponding to mRNA from wold-type 

plants (Q and the fissamine ftuorescenoe corresponding to mRNA from HAT* -transgenic plants (Q) (E 
and F) A single array was probed with a 1:1 mixture of ftuorescervlabeled cDNA from root tissue and 
tissamine-labeted cDNA Irom leaf tissue. The single array was then scanned successively to detect the 
f^esc^fkxxescance corresponong to mRNAs expressed in roots (E) and the assamine fluorescence 
corresponding to mRNAs expressed in leaves (F). 
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CAB! 



HAT4 



flOCJ 





w 0.1 0.01 i.o ai o.oi 

mRNA Oiq) 



Human 
AChR 



20 2.0 0.2 
mftMA(ng) 

Fig. 2. Gene expression monitored with RNA 
(Northern) blot analysis. Designated amounts of 
mRNA from wild-type and H4 74 -transgenic 
plants were spotted onto nylon membranes and 
probed with the cONAs indicated. Purified human 
AChR mRNA was used for calibration. • 



and D, and Tabic 2). Hybridization of flu- 
orescein-labeled glucocorticoid receptor 
cDNA (Fig. 1C) and lissamine-labcled 
TRP4 cDNA (Fig. ID) verified the pres. 
ence of the negative control targets and the 
lack of optical cross talk between the two 
fluorophores. 

To explore a more complex alteration in 
expression patterns, we performed a second 
two-color hybridization experiment with 
fluorescein- and lissaminc- labeled probes 
prepared from root and leaf mRNA, respec- 
tively. The scanning sensitivities for the 
two fluorophores were normalized by 
matching the signals resulting from AChR 



mRNA, which was added to both cDNA 
synthesis reactions at a dilution of 1:1000 
(Fig. 1 , E and F). A comparison of the scans 
revealed widespread differences in gene ex- 
pression between root and leaf tissue (Fig. 1, 
E and F). The mRNA from the light-regu- 
lated CAB! gene was -500-fold more abun- 
dant in leaf (Fig. IF) than in root tissue 
(Fig. IE). The expression of 26 other genes 
differed between root and leaf tissue by 
more than a factor of 5 (Fig. 1, E and F). 

The HAT4-transgenic line we examined 
has elongated hypocotyls, early flowering, 
poor germination, and altered pigmentation 
io). Although changes in expression were 



Table 1. Sequences contained on the cONA microarray. Shown is the Dosrtion ttv»i™^~ * 
function, and the accession nunber of each cDNA in^micTo^ rP« 1^?^ or putative 

in this study matched 8 sequence in the daXsT^^^ 1 ^ ^ ,h< L EST * ^ 
dinuc»eotide; ATPase. adenosine WPhosphata^^ mcot.nam.de adenne 



Position 



cONA 



Function 



81.2 

a3.4 
a5.6 
87.8 
aS, 10 
all, 12 
01.2 

fc>3, 4 
b5, 6 
b7.8 
b9. 10 
bil, 12 
c1.2 
C3.4 
C5.6 
C7.8 
c9, 10 
C11. 12 
d1.2 
d3.4 

d5.6 

d7,8 

d9. 10 

d11 

el. 2 

e3.4 

e5.6 

e7.8 

e9. 10 

e11, 12 

n.2 

13.4 

f5.6 

f7,8 
f9. 10 
111. 12 
91.2 

g3.4 

05,6 
07.8 

g9. 10 
on. 12 
h1.2 
h3,4 
h5.6 
h7.8 
h9. 10 
Ml. 12 



Accession 
number 



,12 



AChR 

EST3 

EST6 

AAC1 

EST12 

EST13 

CABI 

EST17 

GA4 

EST 19 

GBF-7 

EST23 

EST29 

GBF-2 

EST34 

EST35 

EST41 

rGR 

EST42 

EST45 

HAT1 

EST46 

EST49 

HAT2 

HAT 4 

EST50 

HATS 

EST51 

HAT22 

EST52 

EST59 

KNATl 

EST60 

EST69 

PPH1 

EST 70 

EST 75 

EST 78 

ROC1 

EST82 

EST83 

EST64 

EST91 

EST96 

SAR1 

EST100 

EST103 

TRP4 



Human AChR 
Actin 

NADH dehydrooenase 
Actin 1 
Unknown 
Actin 

Chlorophyll a/b binding 
Prx>sprK3g*ycerate knase 
Gtobereflic acid biosynthesis 
Unknown 

G-bc* binding factor 1 
Elongation factor 
Aldolase 

G*box binding factor 2 
ChJoroplast protease 
Unknown 
Catatase 

Rat glucocwticoid receptor 
Unknown 
ATPase 

Homeobox -leucine zipper 1 
Ught harvesting complex 
Unknown 

Horneobox-teucine zipper 2 
. Horneobox-teucine zipper 4 
Phosphonbutokinase 
Homeobox-leucine zipper 5 
Unknown 

Hc*neobox4eucine zipper 22 
Oxygen evolving 
Unknown 

Kooffed-Kke homeobox 1 
RuSisCO small subunit 
Translation elongation factor 
Protein phosphatase 1 
Unknown 

Chloroplast protease 
Unknown 
CycJophQin 
GTP binding 
Unknown 
Unknown 
Unknown 
Unknown 
Synaptobrevin 
Light harvesting complex 
Ught harvesting complex 

Yeast try ptophan biosynthesis 

'Proprietary sequence of Slrataoene (U Joiia. CaJtlome). 



H36236 

227010 

M20016 

U35594T 

T45783 

M85150 

T44490 

L37126 

U36595I 

X63894 

X52256 

T04477 

X63895 

R87034 

T14152 

T22720 

M14053 

U36596t 

J04185 

U09332 

T04063 

T76267 

U09335 

M90394 

T04344 

M90416 

233675 

U09336 

T21749 

234607 

U14174 

X14564 

T42799 

U34803 

T44621 

T43698 

R65481 

L14844 

X59152 

233795 

T45278 

T13832 

R64816 

M90418 

218205 

X03909 

X04273 



observed for HAT4. large chances in ex. 
pression were not observed for any of the 
other 44 genes we examined. Thia was 
somewhat surprising, particularly because 
comparative analysis of leaf and root tissue 
identified 27 differentially expressed genes. 
Analysis of an expanded set of genes may be 
required to identify genes whose expression 
changes upon HAT4 overexpression; alter- 
natively, a comparison of mRNA popula- 
tions from specific tissues of wild-type and 
HAT4-transgenic plants may allow identi- 
fication of downstream genes. 

At the current density of robotic printing, 
it is feasible to scale up the fabrication pro- 
^iJ 0 producc ****** «>waining 20,000 
cDNA targets. At this density, a single array 
would be sufficient to provide gene^pecific 
targets encompassing nearly the entire rep- 
ertoire of expressed genes in the Arabidopsis 
genome (2). The availability of 20,274 ESTs 
from Arabidopsis (J, 9) would provide a rich 
source of templates for such studies. 

The estimated 100,000 genes in the hu- 
man genome {10) exceeds the number of 
Arabidopsis genes by a factor of 5 (2). This 
modest increase in complexity suggests that 
similar cDNA microarrays, prepared from 
£c rapidly growing repertoire of human 
ESTs (i), could be used to determine the 
expression patterns of tens of thousands of 
human genes in diverse cell types. Coupling 
an amplification strategy to the reverse 
traiiscription reaction (li) could make it 
feasible to monitor expression even in 
minute tissue samples. A wide variety of 
acute and chronic physiological and patho- 
logical conditions might lead to character- 
istic changes in the patterns of gene expres- 
sion in peripheral blood cells or other easily 
sampled tissues. In concert with cDNA mi- 
croarrays for monitoring complex expres- 
sion patterns, these tissues might therefore 
serve as sensitive in vivo sensors for clinical 
diagnosis. M icroarrays of cDNAs could thus 
provide a useful link between human gene 
sequences and clinical medicine. 

Table 2. Gene expression rrorwtomqbymicroaf. 
ray and RNA blot analyses; tg. H*r4.transgentc 
See Table 1 for additional gene WormatjoT^-' 
pression levels (w/w) were calibrated with the use 
of known amounts of human AChR mRNA. Values 
for the microarray were determined from microar- 
ray scans (Fig. 1); values for the RNA blot were 
determined from RNA blots (fig 2) 



TNo match in the database: ncveJ EST. 
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Gene 


Expression la 
Microarray 


«l(wAv) 
RNA blot 


CAS/ 
CAB! (tg) 
HAT 4 
HAT 4 (tg) 
ROC1 
ROC1 (tg) 


1:48 

1:120 

1:8300 

1:150 

1:1200 

1:260 


1:83 

1:150 

1:6300 

1510 

1:1800 

1:1300 
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Gene Therapy in Peripheral Blood 
Lymphocytes and Bone Marrow for 
ADA Immunodeficient Patients 

Claudio Bordignon,* Luigi D. Notarangelo, Nadia Nobili, 
Giuliana Ferrari, Giulia Casorati, Paola Panina, Evelina Mazzolari, 
Daniela Maggioni, Claudia Rossi, Paolo Servida, 
Alberto G. Ugazio, Fulvio Mavilio 

Adenosine deaminase (ADA) deficiency results in severe combined Immunodeficiency, 
the first genetic disorder treated by gene therapy. Two different retroviral vectors were 
used to transfer ex vivo the human ADA minigene into bone marrow cells and peripheral 
blood lymphocytes from two patients undergoing exogenous enzyme replacement ther- 
apy. After 2 years of treatment, long-term survival of T and B lymphocytes, marrow cells, 
and granulocytes expressing the transferred ADA gene was demonstrated and resulted 
in normalization of the immune repertoire and restoration of cellular and humoral immunity. 
After discontinuation of treatment, T lymphocytes, derived from transduced peripheral 
blood lymphocytes, were progressively replaced by marrow-derived T cells in both pa- 
tients. These results indicate successful gene transler into long-lasting progenitor cells, 
producing a functional multilineage progeny. 



Severe combined immunodeficiency asso- 
ciated with inherited deficiency of ADA 
(J) is usually fatal unless affected children 
art kept in protective isolation or the im- 
mune system is reconstituted by bone mar- 
row transplantation from a human leuko- 
cyte antigen (HLAMdentical sibling donor 
(2). This is the therapy of choice, although 
it is available only for a minority of patients. 
In recent years, other forms of therapy have 
been developed, including transplants from 
haploidenttcal donors (3,4), exogenous en- 
xyme replacement (5), and somatic-cell 
gene therapy (6-9). 

We previously reported a preclinical mod- 
el in which ADA gene transfer and expression 
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successfully restored immune tuna ions in hu- 
man ADA-deficient (ADA") peripheral 
blood lymphocytes (PBLs) in irnmunodefi- 
cient mice in vivo (JO, J JJ. On the basis of 
these preclinical results, the clinical applica- 
tion of gene therapy for the treaonent of 
ADA - SCID (severe combined tnimunodefi- 
ciency disease) patients who previously felled 
exogenous enzyme replacement therapy was 
approved by our lrmiturianal Ethical Com- 
mittees and by the Italian National Commit- 
tee for Bioethics (12). In addition to evaluat- 
ing the safety and efficacy of the gene therapy 
procedure, the aim of the study was to define 
. the relative role of PBLs and hematopoietic 
stem cells in the long-term recormituoon of 
immune functions after retroviral vector-me- 
diated ADA gene transfer. For this purpose, 
two structurally identical vectors expressing 
the human ADA complementary DNA 
(cDNA), distinguishable by the presence of 
alternative restriction sites in a nonfunctional 
region of the viral long-terminal repeat 
(LTR), were used to transduce PBLs and bone 
marrow (BM) cells independently. This pro- 
cedure allowed identification of the origin of 
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MWHOP AMD APPARATUS FOR FABRTCATIHQ 
MICROAPP^yg nt BIOIiOGir ^T. «Miri>f.*fi 

Field of the Invention 

5 This invention relates to a method and apparatus 

for fabricating microarrays of biological samples for 
large scale screening assays, such as arrays of DNA 
samples to be used in DNA hybridization assays for 
genetic research and diagnostic applications. 

10 
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Background of the Invention 

A variety of methods are currently available for 
making arrays of biological macromolecules, such as 

10 arrays of nucleic acid molecules or proteins. One 
method for making ordered arrays of DNA on a porous 
membrane is a "dot blot" approach. In this method , a 
vacuum manifold transfers a plurality, e.gr. f 96, 
aqueous samples of DNA from 3 millimeter diameter wells 

15 to a porous membrane* A common variant of this 

procedure is a "slot-blot" method in which the wells 
have highly-elongated oval shapes. 

The DNA is immobilized on the porous membrane by 
baking the membrane or exposing it to UV radiation. 

20 This is a manual procedure practical for making one 

array at a time and usually limited to 96 samples per 
array. "Dot-blot" procedures are therefore inadequate 
for applications in which many thousand samples must be 
determined, 

25 A more efficient technique employed for making 

ordered arrays of genomic fragments uses an array of 
pins dipped into the wells, e.g., the 96 wells of a 
microtitre plate, for transferring an array of samples 
to a substrate, such as a porous membrane. One array 

30 includes pins that are designed to spot a membrane in a 
staggered fashion, for creating an array of 9216 spots 
in a 22 x 22 cm area (Lehrach, et al., 1990). A 
limitation with this approach is that the volume of DNA 
spotted in each pixel of each array is highly variable. 
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In addition, the number of arrays that can be made with 
each dipping is usually quite small. 

An alternate method of creating ordered arrays of 
nucleic acid sequences is described by Pirrung, et al. 
5 (1992), and also by Fodor, et al. (1991). The method 
involves synthesizing different nucleic acid sequences 
at different discrete regions of a support. This 
method employs elaborate synthetic schemes, and is 
generally limited to relatively short nucleic acid 

10 sample, e.g., less than 20 bases* A related method has 
been described by Southern, et al. (1992). 

Khrapko, et al. (1991) describes a method of 
making an oligonucleotide matrix by spotting DNA onto a 
thin layer of polyacrylamide. The spotting is done 

15 manually with a micropipette. 

None of the methods or devices described in the 
prior art are designed for mass fabrication of 
microarrays characterized by (i) a large number of 
micro-sized assay regions separated by a distance of 

20 50-200 microns or less, and (ii) a well-defined amount, 
typically in the picomole range, of analyte associated 
with each region of the array. 

Furthermore, current technology is directed at 
performing such assays one at a time to a single array 

25 of DNA molecules. For example, the most common method 
for performing DNA hybridizations to arrays spotted 
onto porous membrane involves sealing the membrane in a 
plastic bag (Maniatas, et al., 1989) or a rotating 
glass cylinder (Robbins Scientific) with the labeled 

30 hybridization probe inside the sealed chamber. For 
arrays made on non-porous surfaces, such as a 
microscope slide, each array is incubated with the 
labeled hybridization probe sealed under a coverslip. 
These techniques require a separate sealed chamber for 
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each array which makes the screening and handling of 
many such arrays inconvenient and time intensive. 

Abouzied, et al. (1994) describes a method of 
printing horizontal lines of antibodies on a 
5 nitrocellulose membrane and separating regions of the 
membrane with vertical stripes of a hydrophobic 
material. Each vertical stripe is then reacted with a 
different antigen and the reaction between the 
immobilized antibody and an antigen is detected using a 

10 standard ELISA colorimetric technique. Abouzied 's 
technique makes it possible to screen many one- 
dimensional arrays simultaneously on a single sheet of 
nitrocellulose. Abouzied makes the nitrocellulose 
somewhat hydrophobic using a line drawn with PAP Pen 

15 (Research Products International) . However Abouzied 
does not describe a technology that is capable of 
completely sealing the pores of the nitrocellulose. The 
pores of the nitrocellulose are still physically open 
and so the assay reagents can leak through the 

20 hydrophobic barrier during extended high temperature 
incubations or in the presence of detergents which 
makes the Abouzied technique unacceptable for DNA 
hybridization assays. 

Porous membranes with printed patterns of 

25 hydrophilic/hydrophobic regions exist for applications 
such as ordered arrays of bacteria colonies. QA Life 
Sciences (San Diego CA) makes such a membrane with a 
grid pattern printed on it. However , this membrane has 
the same disadvantage as the Abouzied technique since 

30 reagents can still flow between the gridded arrays 
making them unusable for separate DNA hybridization 
assays. 

Pall Corporation make a 9 6 -well plate with a 
porous filter heat sealed to the bottom of the plate. 
35 These plates are capable of containing different 
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reagents in each well without cross-contamination. 
However, each well is intended to hold only one target 
element whereas the invention described here makes a 
microarray of many biomolecules in each subdivided 
5 region of the solid support. Furthermore, the 96 well 
plates are at least 1 cm thick and prevent the use of 
the device for many color imetric, fluorescent and 
radioactive detection formats which require that the 
membrane lie flat against the detection surface. The 

10 invention described here requires no further processing 
after the assay step since the barriers elements are 
shallow and do not interfere with the detection step 
thereby greatly increasing convenience. 

Hyseq Corporation has described a method of making 

15 an "array of arrays* on a non-porous solid support for 
use with their sequencing by hybridization technique. 
The method described by Hyseq involves modifying the 
chemistry of the solid support material to form a 
hydrophobic grid pattern where each subdivided region 

20 contains a microarray of biomolecules. Hyseq 's flat 
hydrophobic pattern does not make use of physical 
blocking as an additional means of preventing cross 
contamination . 

25 fluTnm»T»Y of the Invention 

The invention includes, in one aspect, a method of 
forming a microarray of analyte-assay regions on a 
solid support, where each region in the array has a 
known amount of a selected, analyte-specif ic reagent. 

30 The method involves first loading a solution of a 
selected analyte-specif ic reagent in a reagent- 
dispensing device having an elongate capillary channel 
(i) formed by spaced-apart, coextensive elongate 
members, (ii) adapted to hold a quantity of the reagent 

35 solution and (iii) having a tip region at which aqueous 
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solution in the channel forms a meniscus • The channel 
is preferably formed by a pair of spaced-apart tapered 
elements. 

The tip of the dispensing device is tapped against 
5 a solid support at a defined position on the support 

surface with an impulse effective to break the meniscus 
in the capillary channel deposit a selected volume of 
solution on the surface, preferably a selected volume 
in the range 0.01 to 100 nl. The two steps are 

10 repeated until the desired array is formed. 

The method may be practiced in forming a plurality 
of such arrays, where the solution-depositing step is 
are applied to a selected position on each of a 
plurality of solid supports at each repeat cycle. 

15 The dispensing device may be loaded with a new 

solution, by the steps of (i) dipping the capillary 
channel of the device in a wash solution, (ii) removing 
wash solution drawn into the capillary channel, and 
(iii) dipping the capillary channel into the new 

20 reagent solution. 

Also included in the invention is an automated 
apparatus for forming a microarray of analyte-assay 
regions on a plurality of solid supports, where each 
region in the array has a known amount of a selected, 

25 analyte-specific reagent. The apparatus has a holder 
for holding, at known positions, a plurality of planar 
supports, and a reagent dispensing device of the type 
described above. 

The apparatus further includes positioning 

30 structure for positioning the dispensing device at a 
selected array position with respect to a support in 

4 

said holder, and dispensing structure for moving the 
dispensing device into tapping engagement against a 
support with a selected impulse effective to deposit a 
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selected volume on the support, e.g., a selected volume 
in the volume range 0.01 to 100 nl. 

The positioning and dispensing structures are 
controlled by a control unit in the apparatus. The 
5 unit operates to (i) place the dispensing device at a 
loading station, (ii) move the capillary channel in the 
device into a selected reagent at the loading station, 
to load the dispensing device with the reagent, and 
(iii) dispense the reagent at a defined array position 

10 on each of the supports on said holder. The unit may 
further operate, at the end of a dispensing cycle, to 
wash the dispensing device by (i) placing the 
dispensing device at a washing station, (ii) moving the 
capillary channel in the device into a wash fluid, to 

15 load the dispensing device with the fluid, and (iii) 
remove the wash fluid prior to loading the dispensing 
device with a fresh selected reagent. 

The dispensing device in the apparatus may be one 
of a plurality of such devices which are carried on the 

20 arm for dispensing different analyte assay reagents at 
selected spaced array positions. 

In another aspect, the invention includes a 
substrate with a surface having a microarray of at 
least 10 3 distinct polynucleotide or polypeptide 

25 biopolymers in a surface area of less than about 1 cm 2 . 
Each distinct biopolymer (i) is disposed at a separate, 
defined position in said array, (ii) has a length of at 
least 50 subunits, and (iii) is present in a defined 
amount between about 0.1 femtomoles and 100 nanomoles. 

30 In one embodiment, the surface is glass slide 

surface coated with a polycationic polymer, such as 
poly lysine, and the biopolymers are polynucleotides. 
In another embodiment, the substrate has a water- 
impermeable backing, a water-permeable film formed on 
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the backing, and a grid formed on the film. The grid 
is composed of intersecting water- impervious grid 
elements extending from said backing to positions 
raised above the surface of said film, and partitions 
5 the film into a plurality of water-impervious cells, A 
biopolymer array is formed within each well. 

More generally, there is provided a substrate for 
use in detecting binding of labeled polynucleotides to 
one or more of a plurality different-sequence, 

10 immobilized polynucleotides. The substrate includes, 
in one aspect, a glass support, a coating of a 
polycationic polymer, such as poly lysine, on said 
surface of the support, and an array of distinct 
polynucleotides electrostatically bound non-covalently 

15 to said coating, where each distinct biopolymer is 

disposed at a separate, defined position in a surface 
array of polynucleotides. 

In another aspect, the substrate includes a water- 
impermeable backing, a water-permeable film formed on 

20 the backing, and a grid formed on the film, where the 
grid is composed of intersecting water- impervious grid 
elements extending from the backing to positions raised 
above the surface of the film, forming a plurality of 
cells. A biopolymer array is formed within each cell. 

25 Also forming part of the invention is a method of 

detecting differential expression of each of a 
plurality of genes in a first cell type, with respect 
to expression of the same genes in a second cell type. 
In practicing the method, there is first produced 

30 fluorescent-labeled cDNA's from mRNA's isolated from 
the two cells types, where the cDNA'S from the first 
and second cells are labeled with first and second 
different fluorescent reporters. 

A mixture of the labeled cDNA's from the two cell 

35 types is added to an array of polynucleotides 
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representing a plurality of known genes derived from 
the two cell types , under conditions that result in 
hybridization of the cDNA's to complementary-sequence 
polynucleotides in the array. The array is then 
5 examined by fluorescence under fluorescence excitation 
conditions in which (i) polynucleotides in the array 
that are hybridized predominantly to cDNA's derived 
from one of the first and second cell types give a 
distinct first or second fluorescence emission color, 

10 respectively, and (ii) polynucleotides in the array 

that are hybridized to substantially equal numbers of 
cDNA's derived from the first and second cell types 
give a distinct combined fluorescence emission color, 
respectively. The relative expression of known genes 

15 in the two cell types can then be determined by the 
observed fluorescence emission color of each spot. 

These and other objects and features of the 
invention will become more fully apparent when the 
following detailed description of the invention is read 

20 in conjunction with the accompanying figures. 

Brief Description of the Drawings 
Fig. 1 is a side view of a reagent-dispensing 
device having a open-capillary dispensing head 
25 constructed for use in one embodiment of the invention; 

Figs. 2A-2C illustrate steps in the delivery of a 
f ixed-volume bead on a hydrophobic surface employing 
the dispensing head from Fig. 1, in accordance with one 
embodiment of the method of the invention; 
30 Fig. 3 shows a portion of a two-dimensional array 

of analyte-assay regions constructed according to the 
method of the invention; 

Fig. 4 is a planar view showing components of an 
automated apparatus for forming arrays in accordance 
35 with the invention. 



WO 95/35505 



PCT/US95/07659 



10 

Fig. 5 shows a fluorescent image of an actual 20 x 
20 array of 400 fluorescent ly-labeled DNA samples 
immobilized on a poly-l-lysine coated slide, where the 
total area covered by the 400 element array is 16 
5 square millimeters; 

Fig. 6 is a fluorescent image of a 1.8 cm x 1.8 cm 
microarray containing lambda clones with yeast inserts, 
the fluorescent signal arising from the hybridization 
to the array with approximately half the yeast genome 

10 labeled with a green f luorophore and the other half 
with a red f luorophore; 

Fig. 7 shows the translation of the hybridization 
image of Fig. 6 into a karyotype of the yeast genome, 
where the elements of Fig. -6 microarray contain yeast 

15 DNA sequences that have been previously physically 
mapped in the yeast genome; 

Fig. 8 show a fluorescent image of a 0.5 cmx 0.5 
cm microarray of 24 cDNA clones, where the microarray 
was hybridized simultaneously with total cDNA from wild 

20 type Arabidopsis plant labeled with a green f luorophore 
and total cDNA from a transgenic Arabidopsis plant 
labeled with a red f luorophore, and the arrow points to 
the cDNA clone representing the gene introduced into 
the transgenic Arabidopsis plant; 

25 Fig. 9 shows a plan view of substrate having an 

array of cells formed by barrier elements in the form 
of a grid; 

Fig. 10 shows an enlarged plan view of one of the 
cells in the substrate in Fig. 9, showing an array of 
30 polynucleotide regions in the cell; 

Fig. 11 is an enlarged sectional view of the 
substrate in Fig. 9, taken along a section line in that 
figure; and 

Fig. 12 is a scanned image of a 3 cm x 3 cm 
35 nitrocellulose solid support containing four identical 
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arrays of M13 clones in each of four quadrants, where 
each quadrant was hybridized simultaneously to a 
different oligonucleotide using an open face 
hybridization method. 

5 

Detailed Description of the Invention 

I. Definitions 

Unless indicated otherwise, the terms defined 
below have the following meanings: 

10 "Ligand" refers to one member of a ligand/anti- 

ligand binding pair. The ligand may be, for example, 
one of the nucleic acid strands in a complementary, 
hybridized nucleic acid duplex binding pair; an 
effector molecule in an effector /receptor binding pair; 

15 or an antigen in an antigen/ antibody or 
antigen/ antibody fragment binding pair. 

"Antiligand" refers to the opposite member of a 
ligand/anti-ligand binding pair. The antiligand may be 
the other of the nucleic acid strands in a 

20 complementary, hybridized nucleic acid duplex binding 
pair; the receptor molecule in an effector/ receptor 
binding pair; or an antibody or antibody fragment 
molecule in antigen/antibody or antigen /antibody 
fragment binding pair, respectively. 

25 "Analyte" or "analyte molecule" refers to a 

molecule, typically a macromolecule, such as a 
polynucleotide or polypeptide, whose presence, amount, 
and/ or identity are to be determined. The analyte is 
one member of a ligand/anti-ligand pair. 

30 "Analyte-specif ic assay reagent" refers to a 

molecule effective to bind specifically to an analyte 
molecule. The reagent is the opposite member of a 
ligand/anti-ligand binding pair. 

An "array of regions on a solid support" is a 

35 linear or two-dimensional array of preferably discrete 
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regions, each having a finite area, formed on the 
surface of a solid support. 

A "microarray" is an array of regions having a 
density of discrete regions of at least about 100/cm 2 , 
5 and preferably at least about 1000/cm 2 . The regions in 
a microarray have typical dimensions, e.g., diameters, 
in the range of between about 10-250 /xm, and are 
separated from other regions in the array by about the 
same distance. 

10 A support surface is "hydrophobic" if a aqueous- 

medium droplet applied to the surface does not spread 
out substantially beyond the area size of the applied 
droplet. That is, the surface acts to prevent 
spreading of the droplet applied to the surface by 

15 hydrophobic interaction with the droplet, 

A "meniscus" means a concave or convex surface 
that forms on the bottom of a liquid in a channel as a 
result of the surface tension of the liquid. 

"Distinct biopolymers" , as applied to the 

20 biopolymers forming a microarray, means an array member 
which is distinct from other array members on the basis 
of a different biopolymer sequence, and/ or different 
concentrations of the same or distinct biopolymers, 
and/or different mixtures of distinct or different- 

25 concentration biopolymers. Thus an array of "distinct 
polynucleotides" means an array containing, as its 
members, (i) distinct polynucleotides, which may have a 
defined amount in each member, (ii) different, graded 
concentrations of given-sequence polynucleotides, 

30 and/or (iii) different-composition mixtures of two or 
more distinct polynucleotides. 

"Cell type" means a cell from a given source, 
e.g., a tissue, or organ, or a cell in a given state of 
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dif f erentiation, or a cell associated with a given 
pathology or genetic makeup. 

ii. Method ot Microarr^v Formation 
5 This section describes a method of forming a 

microarray of analyte-assay regions on a solid support 
or substrate, where each region in the array has a 
known amount of a selected, analyte-specif ic reagent. 
Fig. 1 illustrates, in a partially schematic view, 

10 a reagent-dispensing device 10 useful in practicing the 
method. The device generally includes a reagent 
dispenser 12 having an elongate open capillary channel 
14 adapted to hold a quantity of the reagent solution, 
such as indicated at 16, as will be described below. 

15 The capillary channel is formed by a pair of spaced- 

apart, coextensive, elongate members 12a, 12b which are 
tapered toward one another and converge at a tip or tip 
region 18 at the lower end of the channel. More 
generally, the open channel is formed by at least two 

20 elongate, spaced-apart members adapted to hold a 

quantity of reagent solutions and having a tip region 
at which aqueous solution in the channel forms a 
meniscus, such as the concave meniscus illustrated at 
20 in Fig. 2A. The advantages of the open channel 

25 construction of the dispenser are discussed below. 

With continued reference to Fig. 1, the dispenser 
device also includes structure for moving the dispenser 
rapidly toward and away from a support surface, for 
effecting deposition of a known amount of solution in 

30 the dispenser on a support, as will be described below 
with reference to Figs. 2A-2C. In the embodiment 
shown, this structure includes a solenoid 22 which is 
act iva table to draw a solenoid piston 24 rapidly 
downwardly, then release the piston, e.g., under spring 

35 bias, to a normal, raised position, as shown. The 
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dispenser is carried on the piston by a connecting 
member 26, as shown. The just-described moving 
structure is also referred to herein as dispensing 
means for moving the dispenser into engagement with a 
5 solid support , for dispensing a known volume of fluid 
on the support. 

The dispensing device just described is carried on 
an arm 28 that may be moved either linearly or in an x- 
y plane to position the dispenser at a selected 

10 deposition position, as will be described. 

Figs. 2A-2C illustrate the method of depositing a 
known amount of reagent solution in the just-described 
dispenser on the surface of a solid support, such as 
the support indicated at 30. The support is a polymer, 

15 glass, or other solid-material support having a surface 
indicated at 31. 

In one general embodiment, the surface is a 
relatively hydrophilic, i.e., wettable surface, such as 
a surface having native, bound or covalently attached 

20 charged groups. On such surface described below is a 
glass surface having an absorbed layer of a 
polycationic polymer, such as poly-l-lysine. 

In another embodiment, the surface has or is 
formed to have a relatively hydrophobic character, 

25 i.e., one that causes aqueous medium deposited on the 
surface to bead. A variety of known hydrophobic 
polymers, such as polystyrene, polypropylene, or 
polyethylene have desired hydrophobic properties, as do 
glass and a variety of lubricant or other hydrophobic 

30 films that may be applied to the support surface. 

Initially, the dispenser is loaded with a selected 
analyte-specific reagent solution, such as by dipping 
the dispenser tip, after washing, into a solution of 
the reagent, and allowing filling by capillary flow 

35 into the dispenser channel. The dispenser is now moved 
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to a selected position with respect to a support 
surface, placing the dispenser tip directly above the 
support-surface position at which the reagent is to be 
deposited. This movement takes place with the 
5 dispenser tip in its raised position, as seen in Fig. 
2A, where the tip is typically at least several 1-5 mm 
above the surface of the substrate. 

With the dispenser so positioned, solenoid 22 is 
now activated to cause the dispenser tip to move 

10 rapidly toward and away from the substrate surface, 
making momentary contact with the surface, in effect, 
tapping the tip of the dispenser against the support 
surface. The tapping movement of the tip against the 
surface acts to break the liquid meniscus in the tip 

15 channel, bringing the liquid in the tip into contact 
with the support surface. This, in turn, produces a 
flowing of the liquid into the capillary space between 
the tip and the surface, acting to draw liquid out of 
the dispenser channel, as seen in Fig. 2B. 

20 Fig. 2C shows flow of fluid from the tip onto the 

support surface, which in this case is a hydrophobic 
surface. The figure illustrates that liquid continues 
to flow from the dispenser onto the support surface 
until it forms a liquid bead 32. At a given bead size, 

25 i.e., volume, the tendency of liquid to flow onto *the 
surface will be balanced by the hydrophobic surface 
interaction of the bead with the support surface, which 
acts to limit the total bead area on the surface, and 
by the surface tension of the droplet, which tends 

30 toward a given bead curvature. At this point, a given 
bead volume will have formed, and continued contact of 
the dispenser tip with the bead, as the dispenser tip 
is being withdrawn, will have little or no effect on 
bead volume. 
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For liquid-dispensing on a more hydrophilic 
surface, the liquid will have less of a tendency to 
bead, and the dispensed volume will be more sensitive 
to the total dwell time of the dispenser tip in the 
5 immediate vicinity of the support surface, e.g., the 
positions illustrated in Figs. 2B and 2C. 

The desired deposition volume, i.e., bead volume, 
formed by this method is preferably in the range 2 pi 
(picoliters) to 2 nl (nanoliters) , although volumes as 

10 high as 100 nl or more may be dispensed. It will be 
appreciated that the selected dispensed volume will 
depend on (i) the "footprint" of the dispenser tip, 
i.e., the size of the area spanned by the tip, (ii) the 
hydrophobicity of the support surface, and (iii) the 

15 time of contact with and rate of withdrawal of the tip 
from the support surface. In addition, bead size may 
be reduced by increasing the viscosity of the medium, 
effectively reducing the flow time of liquid from the 
dispenser onto the support surface. The drop size may 

20 be further constrained by depositing the drop in a 
hydrophilic region surrounded by a hydrophobic grid 
pattern on the support surface. 

In a typical embodiment, the dispenser tip is 
tapped rapidly against the support surface, with a 

25 total residence time in contact with the support of 
less than about 1 msec, and a rate of upward travel 
from the surface of about 10 cm/ sec. 

Assuming that the bead that forms on contact with 
the surface is a hemispherical bead, with a diameter 

30 approximately equal to the width of the dispenser tip, 
as shown in Fig. 2C, the volume of the bead formed in 
relation to dispenser tip width (d) is given in Table 1 
below. As seen, the volume of the bead ranges between 
2 pi to 2 nl as the width size is increased from about 

35 20 to 200 /in* 
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Table l 



d 


Volume (nl) 


20 nm 


2 x 10 3 


50 nm 


3.1 x 10" 2 


100 urn 


2.5 x 10" 1 


200 jra 


2 



10 At a given tip size, bead volume can be reduced in 

a controlled fashion by increasing surface 
hydrophobicity, reducing time of contact of the tip 
with the surface, increasing rate of movement of the 
tip away from the surface, and/or increasing the 

15 viscosity of the medium. Once these parameters are 

fixed, a selected deposition volume in the desired pi 
to nl range can be achieved in a repeatable fashion. 

After depositing a bead at one selected location 
on a support, the tip is typically moved to a 

20 corresponding position on a second support, a droplet 
is deposited at that position, and this process is 
repeated until a liquid droplet of the reagent has been 
deposited at a selected position on each of a plurality 
of supports. 

25 The tip is then washed to remove the reagent 

liquid, filled with another reagent liquid and this 
reagent is now deposited at each another array position 
on each of the supports. In one embodiment, the tip is 
washed and refilled by the steps of (i) dipping the 

30 capillary channel of the device in a wash solution, 
(ii) removing wash solution drawn into the capillary 
channel, and (iii) dipping the capillary channel into 
the new reagent solution. 

From the foregoing, it will be appreciated that 

35 the tweezers-like, open-capillary dispenser tip 
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provides the advantages that (i) the open channel of 
the tip facilitates rapid, efficient washing and drying 
before reloading the tip with a new reagent, (ii) 
passive capillary action can load the sample directly 
5 from a standard microwell plate while retaining 

sufficient sample in the open capillary reservoir for 
the printing of numerous arrays, (iii) open capillaries 
are less prone to clogging than closed capillaries, and 
(iv) open capillaries do not require a perfectly faced 

10 bottom surface for fluid delivery • 

A portion of a microarray 36 formed on the surface 
38 of a solid support 40 in accordance with the method 
just described is shown in Fig. 3. The array is formed 
of a plurality of analyte-specif ic reagent regions, 

15 such as regions 42 , where each region may include a 
different analyte-specif ic reagent. As indicated 
above, the diameter of each region is preferably 
between about 20-200 jra. The spacing between each 
region and its closest (non-diagonal) neighbor, 

20 measured from center-to-center (indicated at 44), is 

preferably in the range of about 20-400 /ra. Thus, for 
example, an array having a center-to-center spacing of 
about 250 jra contains about 40 regions/cm or 1,600 
regions/cm 2 . After formation of the array, the support 

25 is treated to evaporate the liquid of the droplet 

forming each region, to leave a desired array of dried, 
relatively flat regions. This drying may be done by 
heating or under vacuum. 

In some cases, it is desired to first rehydrate 

30 the droplets containing the analyte reagents to allow 
for more time for adsorption to the solid support. It 
is also possible to spot out the analyte reagents in a 
humid environment so that droplets do not dry until the 
arraying operation is complete. 
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III. Automated Apparatus for Forming Arrays 

In another aspect, the invention includes an 
automated apparatus for forming an array of analyte- 
assay regions on a solid support, where each region in 
5 the array has a known amount of a selected, analyte- 
specific reagent. 

The apparatus is shown in planar, and partially 
schematic view in Fig. 4. A dispenser device 72 in the 
apparatus has the basic construction described above 

10 with respect to Fig. 1, and includes a dispenser 74 

having an open-capillary channel terminating at a tip, 
substantially as shown in Figs. 1 and 2A-2C. 

The dispenser is mounted in the device for 
movement toward and away from a dispensing position at 

15 which the tip of the dispenser taps a support surface, 
to dispense a selected volume of reagent solution, as 
described above. This movement is effected by a 
solenoid 76 as described above. Solenoid 76 is under 
the control of a control unit 77 whose operation will 

20 be described below. The solenoid is also referred to 
herein as dispensing means for moving the device into 
tapping engagement with a support, when the device is 
positioned at a defined array position with respect to 
that support. 

25 The dispenser device is carried on an arm 74 which 

is threadedly mounted on a worm screw 80 driven 
(rotated) in a desired direction by a stepper motor 82 
also under the control of unit 77. At its left end in 
the figure screw 80 is carried in a sleeve 84 for 

30 rotation about the screw axis. At its other end, the 
screw is mounted to the drive shaft of the stepper 
motor, which in turn is carried on a sleeve 86. The 
dispenser device, worm screw, the two sleeves mounting 
the worm screw, and the stepper motor used in moving 

35 the device in the w x" (horizontal) direction in the 
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figure form what is referred to here collectively as a 
displacement assembly 86. 

The displacement assembly is constructed to 
produce precise, micro-range movement in the direction 
5 of the screw, i.e., along an x axis in the figure. In 
one mode, the assembly functions to move the dispenser 
in x-axis increments having a selected distance in the 
range 5-25 /ra.- In another mode, the dispenser unit may 
be moved in precise x-axis increments of several 

10 microns or more,; for positioning the dispenser at 

associated positions on adjacent supports, as will be 
described below. 

The displacement assembly, in turn, is mounted for 
movement in the "y" (vertical) axis of the figure, for 

15 positioning the dispenser at a selected y axis 

position. The structure mounting the assembly includes 
a fixed rod 88 mounted rigidly between a pair of frame 
bars 90, 92, and a worm screw 94 mounted for rotation 
between a pair of frame bars 96, 98. The worm screw is 

20 driven (rotated) by a stepper motor 100 which operates 
under the control of unit 77. The motor is mounted on 
bar 96, as shown. 

The structure just described, including worm screw 
94 and motor 100, is constructed to produce precise, 

25 micro-range movement in the direction of the screw, 
i.e., along an y axis in the figure. As above, the 
structure functions in one mode to move the dispenser 
in y-axis increments having a selected distance in the 
range 5-250 /*m, and in a second mode, to move the 

30 dispenser in precise y-axis increments of several 

microns (jra) or more, for positioning the dispenser at 
associated positions on adjacent supports. 

The displacement assembly and structure for moving 
this assembly in the y axis are referred to herein 

35 collectively as positioning means for positioning the 
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dispensing dfevice at a selected array position with 
respect to a support. 

A holder 102 in the apparatus functions to hold a 
plurality of supports, such as supports 104 on which 
5 the microarrays of regent regions are to be formed by 
the apparatus. The holder provides a number of 
recessed slots, such as slot 106, which receive the 
supports, and position them at precise selected 
positions with respect to the frame bars on which the 

10 dispenser moving means is mounted. 

As noted above, the control unit in the device 
functions to actuate the two stepper motors and 
dispenser solenoid in a sequence designed for automated 
operation of the apparatus in forming a selected 

15 microarray of reagent regions on each of a plurality of 
supports . 

The control unit is constructed, according to 
conventional microprocessor control principles, to 
provide appropriate signals to each of the solenoid and 

20 each of the stepper motors, in a given timed sequence 
and for appropriate signalling time. The construction 
of the unit, and the settings that are selected by the 
user to achieve a desired array pattern, will be 
understood from the following description of a typical 

25 apparatus operation. 

Initially, one or more supports are placed in one 
or more slots in the holder. The dispenser is then 
moved to a position directly above a well (not shown) 
containing a solution of the first reagent to be 

30 dispensed on the support (s) . The dispenser solenoid is 
actuated now to lower the dispenser tip into this well, 
causing the capillary channel in the dispenser to fill. 
Motors 82, 100 are now actuated to position the 
dispenser at a selected array position at the first of 

35 the supports. Solenoid actuation of the dispenser is 
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then effective to dispense a selected-volume droplet of 
that reagent at this location. As noted above, this 
operation is effective to dispense a selected volume 
preferably between 2 pi and 2 nl of the reagent 
5 solution. 

The dispenser is now moved to the corresponding 
position at an adjacent support and a similar volume of 
the solution is dispensed at this position. The 
process is repeated until the reagent has been 

10 dispensed at this preselected corresponding position on 
each of the supports. 

Where it is desired to dispense a single reagent 
at more than two array positions on a support, the 
dispenser may be moved to different array positions at 

15 each support, before moving the dispenser to a new 
support, or solution can be dispensed at individual 
positions on each support, at one selected position, 
then the cycle repeated for each new array position. 
To dispense the next reagent, the dispenser is 

20 positioned over a wash solution (not shown) , and the 
dispenser tip is dipped in and out of this solution 
until the reagent solution has been substantially 
washed from the tip. Solution can be removed from the 
tip, after each dipping, by vacuum, compressed air 

25 spray, sponge, or the like. 

The dispenser tip is now dipped in a second 
reagent well, and the filled tip is moved to a second 
selected array position in the first support. The 
process of dispensing reagent at each of the 

30 corresponding second-array positions is then carried as 
above. This process is repeated until an entire 
microarray of reagent solutions on each of the supports 
has been formed. 

35 IV. Microarray Substrate 
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This section describes embodiments of a substrate 
having a microarray of biological polymers carried on 
the substrate surface. Subsection A describes a multi- 
cell substrate, each cell of which contains a 
5 microarray, and preferably an identical microarray, of 
distinct biopolymers, such as distinct polynucleotides, 
formed on a porous surface. Subsection B describes a 
microarray of distinct polynucleotides bound on a glass 
slide coated with a polycationic polymer. 

10 

A. Multi-Cell Substrate 

Fig. 9 illustrates, in plan view, a substrate 110 
constructed according to the invention. The substrate 
has an 8 x 12 rectangular array 112 of cells, such as 

15 cells 114, 116, formed on the substrate surface. With 
reference to Fig. 10, each cell, such as cell 114, in 
turn supports a microarray 118 of distinct biopolymers, 
such as polypeptides or polynucleotides at known, 
addressable regions of the microarray. Two such 

20 regions forming the microarray are indicated at 120, 

and correspond to regions, such as regions 42, forming 
the microarray of distinct biopolymers shown in Fig. 3. 

The 96-cell array shown in Fig. 9 has typically 
array dimensions between about 12 and 244 mm in width 

25 and 8 and 400 mm in length, with the cells in the array 
having width and length dimension of 1/12 and 1/8 the 
array width and length dimensions, respectively, i.e., 
between about 1 and 20 in width and 1 and 50 mm in 
length. 

30 The construction of substrate is shown cross- 

sect ionally in Fig. 11, which is an enlarged sectional 
view taken along view line 124 in Fig. 9. The 
substrate includes a water- impermeable backing 126, 
such as a glass slide or rigid polymer sheet. Formed 

35 on the surface of the backing is a water-permeable film 
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128 • The film is formed of a porous membrane material , 
such as nitrocellulose membrane, or a porous web 
material, such as a nylon, polypropylene, or PVDF 
porous polymer material- The thickness of the film is 
5 preferably between about 10 and 1000 /xm. The film may 
be applied to the backing by spraying or coating 
uncured material on the backing, or by applying a 
preformed membrane to the backing. The backing and 
film may be obtained as a preformed unit from 

10 commercial source, e.g., a plastic-backed 

nitrocellulose film available from Schleicher and 
Schuell Corporation. 

With continued reference to Fig. 11, the film- 
covered surface in the substrate is partitioned into a 

15 desired array of cells by water- impermeable grid lines, 
such as lines 130, 132, which have infiltrated the film 
down to the level of the backing, and extend above the 
surface of the film as shown, typically a distance of 
100 to 2000 urn above the film surface. 

20 The grid lines are formed on the substrate by 

laying down an uncured or otherwise f lowable resin or 
elastomer solution in an array grid, allowing the 
material to infiltrate the porous film down to the 
backing, then curing or otherwise hardening the grid 

25 lines to form the cell-array substrate. 

One preferred material for the grid is a f lowable 
silicone available from Loctite Corporation. The 
barrier material can be extruded through a narrow 
syringe (e.g., 22 gauge) using air pressure or 

30 mechanical pressure. The syringe is moved relative to 
the solid support to print the barrier elements as a 
grid pattern. The extruded bead of silicone wicks into 
the pores of the solid support and cures to form a 
shallow waterproof barrier separating the regions of 

35 the solid support. 
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In alternative embodiments, the barrier element 
can be a wax-based material or a thermoset material 
such as epoxy. The barrier material can also be a UV- 
curing polymer which is exposed to UV light after being 
5 printed onto the solid support. The barrier material 
may also be applied to the solid support using printing 
techniques such as silk-screen printing • The barrier 
material may also be a heat-seal stamping of the porous 
solid support which seals its pores and forms a water- 

10 impervious barrier element. The barrier material may 
also be a shallow grid which is laminated or otherwise 
adhered to the solid support. 

In addition to plastic-backed nitrocellulose, the 
solid support can be virtually any porous membrane with 

15 or without a non-porous backing. Such membranes are 
readily available from numerous vendors and are made 
from nylon, PVDF, polysulfone and the like. In an 
alternative embodiment, the barrier element may also be 
used to adhere the porous membrane to a non-porous 

20 backing in addition to functioning as a barrier to 
prevent cross contamination of the assay reagents. 

In an alternative embodiment, the solid support 
can be of a non-porous material. The barrier can be 
printed either before or after the microarray of 

25 biomolecules is printed on the solid support. 

As can be appreciated, the cells formed by the 
grid lines and the underlying backing are water- 
impermeable, having side barriers projecting above the 
porous film in the cells. Thus, defined- volume samples 

30 can be placed in each well without risk of cross- 
contamination with sample material in adjacent cells. 
In Fig. 11, defined volumes samples, such as sample 
134, are shown in the cells. 

As noted above, each well contains a microarray of 

35 distinct biopolymers. In one general embodiment, the 
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microarrays in the well are identical arrays of 
distinct biopolymers, e.g., different sequence 
polynucleotides. Such arrays can be formed in 
accordance with the methods described in Section II, by 
5 depositing a first selected polynucleotide at the same 
selected microarray position in each of the cells, then 
depositing a second polynucleotide at a different 
microarray position in each well, and so on until a 
complete, identical microarray is formed in each cell. 

10 In a preferred embodiment, each microarray 

contains about 10 3 distinct polynucleotide or 
polypeptide biopolymers per surface area of less than 
about 1 cm 2 . Also in a preferred embodiment, the 
biopolymers in each microarray region are present in a 

15 defined amount between about 0.1 femtomoles and 100 

nanomoles. The ability to form high-density arrays of 
biopolymers, where each region is formed of a well- 
defined amount of deposited material, can be achieved 
in accordance with the microarray-f orming method 

20 described in Section II. 

Also in a preferred embodiments, the biopolymers 
are polynucleotides having lengths of at least about 50 
bp, i.e., substantially longer than oligonucleotides 
which can be formed in high-density arrays by schemes 

25 involving parallel, step-wise polymer synthesis on the 
array surface. 

In the case of a polynucleotide array, in an assay 
procedure, a small volume of the labeled DNA probe 
mixture in a standard hybridization solution is loaded 

30 onto each cell. The solution will spread to cover the 
entire microarray and stop at the barrier elements. 
The solid support is then incubated in a humid chamber 
at the appropriate temperature as required by the 
assay. 
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Each assay may be conducted in an "open- face" 
format where no further sealing step is required, since 
the hybridization solution will be kept properly 
hydrated by the water vapor in the humid chamber. At 
5 the conclusion of the incubation step, the entire solid 
support containing the numerous microarrays is rinsed 
quickly enough to dilute the assay reagents so that no 
significant cross contamination occurs. The entire 
solid support is then reacted with detection reagents 

10 if needed and analyzed using standard color imetric, 
radioactive or fluorescent detection means. All m 
processing and detection steps are performed 
simultaneously to all of the microarrays on the solid 
support ensuring uniform assay conditions for all of 

15 the microarrays on the solid support. 

B. Glass-Slide Polynucleoti de Array 
Fig. 5 shows a substrate 136 formed according to 
another aspect of the invention, and intended for use 

20 in detecting binding of labeled polynucleotides to one 
or more of a plurality distinct polynucleotides. The 
substrate includes a glass substrate 138 having formed 
on its surface, a coating of a polycat ionic polymer, 
preferably a cationic polypeptide, such as poly lysine 

25 or polyarginine. Formed on the polycationic coating is 
a microarray 140 of distinct polynucleotides, each 
localized at known selected array regions, such as 
regions 142. 

The slide is coated by placing a uniform-thickness 
30 film of a polycationic polymer, e.g., poly-l-lysine, on 
the surface of a slide and drying the film to form a 
dried coating. The amount of polycationic polymer 
added is sufficient to form at least a monolayer of 
polymers on the glass surface. The polymer film is 
35 bound to surface via electrostatic binding between 
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negative silyl-OH groups on the surface and charged 
amine groups in the polymers. Poly-l-lysine coated 
glass slides may be obtained commercially, e.g., from 
Sigma Chemical Co. (St. Louis, MO) . 
5 To form the microarray, defined volumes of 

distinct polynucleotides are deposited on the polymer- 
coated slide, as described in Section II. According to 
an important feature of the substrate, the deposited 
polynucleotides remain bound to the coated slide 

10 surface non-covalently when an aqueous DNA sample is 
applied to the substrate under conditions which allow 
hybridization of reporter-labeled polynucleotides in 
the sample to complementary-sequence (single-stranded) 
polynucleotides in the substrate array. The method is 

15 illustrated in Examples 1 and 2. 

To illustrate this feature, a substrate of the 
type just described, but having an array of same- 
sequence polynucleotides, was mixed with fluorescent- 
labeled complementary DNA under hybridization 

20 conditions. After washing to remove non-hybridized 
material, the substrate was examined by low-power 
fluorescence microscopy. The array can be visualized 
by the relatively uniform labeling pattern of the array 
regions. 

25 In a preferred embodiment, each microarray 

contains at least 10 3 distinct polynucleotide or 
polypeptide biopolymers per surface area of less than 
about 1 cm 2 . In the embodiment shown in Fig. 5, the 
microarray contains 400 regions in an area of about 16 

30 mm 2 , or 2.5 x 10 3 regions/cm 2 . Also in a preferred 

embodiment, the polynucleotides in the each microarray 
region are present in a defined amount between about 
0.1 femtomoles and 100 nanomoles in the case of 
polynucleotides. As above, the ability to form high- 
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density arrays of this type, where each region is 
formed of a well-defined amount of deposited material, 
can be achieved in accordance with the microarray- 
forming method described in Section II. 
5 Also in a preferred embodiments, the 

polynucleotides have lengths of at least about 50 bp, 
i.e., substantially longer than oligonucleotides which 
can be formed in high-density arrays by various in situ 
synthesis schemes. 

10 

V. utjutv 

Hicroarrays of immobilized nucleic acid sequences 
prepared in accordance with the invention can be used 
for large scale hybridization assays in numerous 

15 genetic applications, including genetic and physical 

mapping of genomes, monitoring of gene expression, DNA 
sequencing, genetic diagnosis, genotyping of organisms, 
and distribution of DNA reagents to researchers. 

For gene mapping, a gene or a cloned DNA fragment 

20 is hybridized to an ordered array of DNA fragments, and 
the identity of the DNA elements applied to the array 
is unambiguously established by the pixel or pattern of 
pixels of the array that are detected. One application 
of such arrays for creating a genetic map is described 

25 by Nelson, et al. (1993). In constructing physical 
maps of the genome, arrays of immobilized cloned DNA 
fragments are hybridized with other cloned DNA 
fragments to establish whether the cloned fragments in 
the probe mixture overlap and are therefore contiguous 

30 to the immobilized clones on the array. For example, 
Lehrach, et al., describe such a process. 

The arrays of immobilized DNA fragments may also 
be used for genetic diagnostics. To illustrate, an 
array containing multiple forms of a mutated gene or 

35 genes can be probed with a labeled mixture of a 
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patient's DNA which will preferentially interact with 
only one of the immobilized versions of the gene. 

The detection of this interaction can lead to a 
medical diagnosis. Arrays of immobilized DNA fragments 
5 can also be used in DNA probe diagnostics. For 

example, the identity of a pathogenic microorganism can 
be established unambiguously by hybridizing a sample of 
the unknown pathogen's DNA to an array containing many 
types of known pathogenic DNA. A similar technique can 

10 also be used for junambiguous genotyping of any 

organism. Other molecules of genetic interest, such as 
cDNA's and RNA's can be immobilized on the array or 
alternately used as the labeled probe mixture that is 
applied to the array. 

15 In one application, an array of cDNA clones 

representing genes is hybridized with total cDNA from 
an organism to monitor gene expression for research or 
diagnostic purposes. Labeling total cDNA from a normal 
cell with one color f luorophore and total cDNA from a 

20 diseased cell with another color f luorophore and 

simultaneously hybridizing the two cDNA samples to the 
same array of cDNA clones allows for differential gene 
expression to be measured as the ratio of the two 
f luorophore intensities. This two-color experiment can 

25 be used to monitor gene expression in different tissue 
types, disease states, response to drugs, or response 
to environmental factors. & An example of this approach 
is illustrated in Examples 2, described with respect to 
Fig. 8. 

30 By way of example and without implying a 

limitation of scope, such a procedure could be used to 
simultaneously screen many patients against all known 
mutations in a disease gene. This invention could be 
used in the form of, for example, 96 identical 0.9 cm x 

35 2.2 cm microarrays fabricated on a single 12 cm x 18 cm 
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sheet of plastic-backed nitrocellulose where each 
microarray could contain, for example, 100 DNA 
fragments representing all known mutations of a given 
gene. The region of interest from each of the DNA 
5 samples from 96 patients could be amplified, labeled, 
and hybridized to the 96 individual arrays with each 
assay performed in 100 microliters of hybridization 
solution. The approximately 1 thick silicone rubber 
barrier elements between individual arrays prevent 

10 cross contamination of the patient samples by sealing 
the pores of the nitrocellulose and by acting as a 
physical barrier between each microarray. The solid 
support containing all 96 microarrays assayed with the 
96 patient samples is incubated, rinsed, detected and 

15 analyzed as a single sheet of material using standard 
radioactive, fluorescent, or color imetric detection 
means (Maniatas, et al. , 1989) . Previously , such a 
procedure would involve the handling, processing and 
tracking of 96 separate membranes in 96 separate sealed 

20 chambers. By processing all 96 arrays as a single 

sheet of material, significant time and cost savings 
are possible. 

The assay format can be reversed where the patient 
or organism's DNA is immobilized as the array elements 

25 and each array is hybridized with a different mutated 
allele or genetic marker. The gridded solid support 
can also be used for parallel non-DNA ELISA assays. 
Furthermore, the invention allows for the use of all 
standard detection methods without the need to remove 

30 the shallow barrier elements to carry out the detection 
step . 

In addition to the genetic applications listed 
above, arrays of whole cells, peptides, enzymes, 
antibodies, antigens, receptors, ligands, 
35 phospholipids, polymers, drug cogener preparations or 
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chemical substances can be fabricated by the means 
described in this invention for large scale screening 
assays in medical diagnostics, drug discovery, 
molecular biology, immunology and toxicology. 

The multi-cell substrate aspect of the invention 
allows for the rapid and convenient screening of many 
DNA probes against many ordered arrays of DNA 
fragments. This eliminates the need to handle and 
detect many individual arrays for performing mass 
screenings for genetic research and diagnostic 
applications. Numerous microarrays can be fabricated 
on the same solid support and each microarray reacted 
with a different DNA probe while the solid support is 
processed as a single sheet of material. 

The following examples illustrate, but in no way 
are intended to limit, the present invention. 

Example 1 

20 Genomic-Complexitv Hybridization to Micro 

DNA Arrays Representing the Yeast 
Saccharomyces cerevisiae Genome with 
Two-Color Fluorescent Detection 

The array elements were randomly amplified PCR 

25 (Bohlander, et al., 1992) products using physically 

mapped lambda clones of S. cerevisiae genomic DNA 

templates (Riles, et al., 1993). The PCR was performed 

directly on the lambda phage lysates resulting in an 

amplification of both the 35 kb lambda vector and the 

30 5-15 kb yeast insert sequences in the form of a uniform 

distribution of PCR product between 250-1500 base pairs 

in length. The PCR product was purified using 

Sephadex G50 gel filtration (Pharmacia, Piscataway, NJ) 

and concentrated by evaporation to dryness at room 

35 temperature overnight. Each of the 864 amplified 
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lambda clones was rehydrated in 15 pi of 3 x SSC in 
preparation for spotting onto the glass. 

The micro arrays were fabricated on microscope 
slides which were coated with a layer of poly-l-lysine 
5 (Sigma) . The automated apparatus described in Section 
IV loaded 1 /tl of the concentrated lambda clone PGR 
product in 3 x SSC directly from 96 well storage plates 
into the open capillary printing element and deposited 
-5 nl of sample per slide at 380 micron spacing between 

10 spots, on each of 40 slides. The process was repeated 
for all 864 samples and 8 control spots. After the 
spotting operation was complete, the slides were 
rehydrated in a humid chamber for 2 hours, baked in a 
dry 80° vacuum oven for 2 hours, rinsed to remove un- 

15 absorbed DNA and then treated with succinic anhydride 
to reduce non-specific adsorption of the labeled 
hybridization probe to the poly-l-lysine coated glass 
surface. Immediately prior to use, the immobilized DNA 
on the array was denatured in distilled water at 90° 

20 for 2 minutes. 

For the pooled chromosome experiment, the 16 
chromosomes of Saccharomyces cerevisiae were separated 
in a CHEF agarose gel apparatus (Biorad, Richmond, CA) . 
The six largest chromosomes were isolated in one gel 

25 slice and the smallest 10 chromosomes in a second gel 
slice. The DNA was recovered using a gel extraction 
kit (Qiagen, Chatsworth, CA) . The two chromosome pools 
were randomly amplified in a manner similar to that 
used for the target lambda clones. Following 

30 amplification, 5 micrograms of each of the amplified 

chromosome pools were separately random-primer labeled 
using Klenow polymerase (Amersham, Arlington Heights, 
IL) with a lissamine conjugated nucleotide analog 
(Dupont NEN, Boston, MA) for the pool containing the 

35 six largest chromosomes, and with a fluorescein 
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conjugated nucleotide analog (BMB) for the pool 
containing smallest ten chromosomes. The two pools 
were mixed and concentrated using an ultrafiltration 
device (Amicon, Danvers, MA). 
5 Five micrograms of the hybridization probe 

consisting of both chromosome pools in 7.5 /xl of TE was 
denatured in a boiling water bath and then snap cooled 
on ice. 2.5 pi of concentrated hybridization solution 
(5 x SSC and 0.1% SDS) was added and all 10 fil 

10 transferred to the array surface, covered with a cover 
slip, placed in a custom-built single-slide humidity 
chamber and incubated at 60° for 12 hours. The slides 
were then rinsed at room temperature in 0.1 x SSC and 
0.1%SDS for 5 minutes, cover slipped and scanned. 

15 A custom built laser fluorescent scanner was used 

to detect the two-color hybridization signals from the 
1.8 x i # 8 cm array at 20 micron resolution. The 
scanned image was gridded and analyzed using custom 
image analysis software. After correcting for optical 

20 crosstalk between the fluorophores due to their 
overlapping emission spectra, the red and green 
hybridization values for each clone on the array were 
correlated to the known physical map position of the 
clone resulting in a computer-generated color karyotype 

25 of the yeast genome. 

Figure 6 shows the hybridization pattern of the 
two chromosome pools. A red signal indicates that the 
lambda clone on the array surface contains a cloned 
genomic DNA segment from one of the largest six yeast 

30 chromosomes. A green signal indicates that the lambda 
clone insert comes from one of the smallest ten yeast 
chromosomes. Orange signals indicate repetitive 
sequences which cross hybridized to both chromosome 
pools. Control spots on the array confirm that the 

35 hybridization is specific and reproducible. 
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The physical map locations of the genomic DMA 
fragments contained in each of the clones used as array 
elements have been previously determined by Olson and 
co-workers (Riles, et al.) allowing for the automatic 
5 generation of the color karyotype shown in Figure 7. 
The color of a chromosomal section on the karyotype 
corresponds to the color of the array element 
containing the clone from that section. The black 
regions of the karyotype represent false negative dark 

10 spots on the array (10%) or regions of the genome not 
covered by the Olson clone library (90%) . Note that 
the largest six chromosomes are mainly red while the 
smallest ten chromosomes are mainly green matching the 
original CHEF gel isolation of the hybridization probe. 

15 Areas of the red chromosomes containing green spots and 
vice-versa are probably due to spurious sample tracking 
errors in the formation of the original library and in 
the amplification and spotting procedures. 

The yeast genome arrays have also been probed with 

20 individual clones or pools of clones that are 

f luorescently labeled for physical mapping purposes. 
The hybridization signals of these clones to the array 
were translated into a position on the physical map of 
yeast. 

25 

Example 2 

Total cDNA Hybridized to Micro Arrays of 
cDNA Clones with Two-Color 
Fluorescent Detection 

30 24 clones containing cDNA inserts from the plant 

Arabidopsis were amplified using PGR. Salt was added 
to the purified PCR products to a final concentration 
of 3 x SSC. The cDNA clones were spotted on poly-1- 
lysine coated microscope slides in a manner similar to 

35 Example 1. Among the cDNA clones was a clone 
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representing a transcription factor HAT 4, which had 
previously been used to create a transgenic line of the 
plant Arabidopsis , in which this gene is present at ten 
times the level found in wild-type Arabidopsis (Schena, 
5 et al. , 1992) . 

Total poly-A mRNA from wild type Arabidopsis was 
isolated using standard methods (Maniatis, et al., 
1989) and reverse transcribed into total cDNA, using 
fluorescein nucleotide analog to label the cDNA product 

10 (green fluorescence) . A similar procedure was 

performed with the transgenic line of Arabidopsis where 
the transcription factor HAT4 was inserted into the 
genome using standard gene transfer protocols* cDNA 
copies of mRNA from the transgenic plant are labeled 

15 with a lissamine nucleotide analog (red fluorescence) . 
Two micrograms of the cDNA products from each type of 
plant were pooled together and hybridized to the cDNA 
clone array in a 10 microliter hybridization reaction 
in a manner similar to Example l. Rinsing and 

20 detection of hybridization was also performed in a 

manner similar to Example 1. Pig. 8 show the resulting 
hybridization pattern of the array. 

Genes equally expressed in wild type and the 
transgenic Arabidopsis appeared yellow due to equal 

25 contributions of the green and red fluorescence to the 
final signal. The dots are different intensities of 
yellow indicating various levels of gene expression. 
The cDNA clone representing the transcription factor 
HAT4, expressed in the transgenic line of Arabidopsis 

30 but not detectably expressed in wild type Arabidopsis, 
appears as a red dot (with the arrow pointing to it) , 
indicating the preferential expression of the 
transcription factor in the red-labeled transgenic 
Arabidopsis and the relative lack of expression of the 
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transcription factor in the green-labeled wild type 
Arabidopsis . 

An advantage of the microarray hybridization 
format for gene expression studies is the high partial 
5 concentration of each cDNA species achievable in the 10 
microliter hybridization reaction. This high partial 
concentration allows for detection of rare transcripts 
without the need for PCR amplification of the 
hybridization probe which may bias the true genetic 

10 representation of each discrete cDNA species. 

Gene expression studies such as these can be used 
for genomics research to discover which genes are 
expressed in which cell types, disease states, 
development states or environmental conditions. Gene 

15 expression studies can also be used for diagnosis of 
disease by empirically correlating gene expression 
patterns to disease states. 

Example 3 

20 Multiplexed Colorimetric Hybridization on 

a Gridded Solid Support 

A sheet of plastic-backed nitrocellulose was 

gridded with barrier elements made from silicone rubber 

according to the description in Section IV-A. The 

25 sheet was soaked in 10 x SSC and allowed to dry. As 

shown in Fig. 12, 192 M13 clones each with a different 
yeast inserts were arrayed 400 microns apart in four 
quadrants of the solid support using the automated 
device described in Section III. The bottom left 

30 quadrant served as a negative control for hybridization 
while each of the other three quadrants was hybridized 
simultaneously with a different oligonucleotide using 
the open-face hybridization technology described in 
Section IV-A. The first two and last four elements of 
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each array are positive controls for the coiorimetric 
detection step. 

The oligonucleotides were labeled with fluorescein 
which was detected using an anti-f luorescein antibody 
5 conjugated to alkaline phosphatase that precipitated an 
NBT/BCIP dye on the solid support (Amersham) . Perfect 
matches between the labeled oligos and the M13 clones 
resulted in dark spots visible to the naked eye and 
detected using an optical scanner (HP ScanJet II) 

10 attached to a personal computer. The hybridization 
patterns are different in every quadrant indicating 
that each oligo found several unique M13 clones from 
among the 192 with a perfect sequence match. Note that 
the open capillary printing tip leaves detectable 

15 dimples on the nitrocellulose which can be used to 
automatically align and analyze the images. 

Although the invention has been described with 
respect to specific embodiments and methods, it will be 
20 clear that various changes and modification may be made 
without departing from the invention. 
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IT IS CLAIMED: 

1. A method of forming a microarray of analyte- 
assay regions on a solid support , where each region in 
the array has a known amount of a selected , analyte- 
specific reagent, said method comprising, 

(a) loading a solution of a selected analyte- 
specif ic reagent in a reagent-dispensing device having 
an elongate capillary channel (i) formed by spaced- 
apart, coextensive elongate members, (ii) adapted to 
hold a quantity of the reagent solution and (iii) 
having a tip region at which aqueous solution in the 
channel forms a meniscus, 

(b) tapping the tip of the dispensing device 
against a solid support at a defined position on the 
surface, with an impulse effective to break the 
meniscus in the capillary channel and deposit: a 
selected volume of solution on the surface, and 

(c) repeating steps (a) and (b) until said array 
is formed. 

2. The method of claim 1, wherein said tapping is 
carried out with an impulse effective to deposit a 
selected volume in the volume range between 0.01 to 100 

25 nl. 

3. The method of claim 1, wherein said channel is 
formed by a pair of spaced-apart tapered elements. 

30 4. The method of claim 1, for forming a plurality 

of such arrays, wherein step (b) is applied to a 
selected position on each of a plurality of solid 
supports at each repeat cycle proceeding step (c) . 
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5. The method of claim 1, which further includes, 
after performing steps (a) and (b) at least one time, 
reloading the reagent-dispensing device with a new 
reagent solution by the steps of (i) dipping the 
5 capillary channel of the device in a wash solution, 
(ii) removing wash solution drawn into the capillary 
channel, and (iii) dipping the capillary channel into 
the new reagent solution. 

10 6. Automated apparatus for forming a microarray 

of analyte-assay regions on a plurality of solid 
supports, where each region in the array has a known 
amount of a selected, analyte-specif ic reagent, said 
apparatus comprising 

15 (a) a holder for holding, at known positions, a 

plurality of planar supports, 

(b) a reagent dispensing device having ah open 
capillary channel (i) formed by spaced-apart , 
coextensive elongate members (ii) adapted to hold a 

20 quantity of the reagent solution and (iii) having a tip 
region at which aqueous solution in the channel forms a 
meniscus, 

(c) positioning means for positioning the 
dispensing device at a selected array position with 

25 respect to a support in said holder, 

(d) dispensing means for moving the device into 
tapping engagement against a support with a selected 
impulse, when the device is positioned at a defined 
array position with respect to that support, with an 

30 impulse effective to break the meniscus of liquid in 
the capillary channel and deposit a selected volume of 
solution on the surface, and 

(e) control means for controlling said positioning 
and dispensing means. 
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7. The apparatus of claim 6, wherein said 
dispensing means is effective to move said dispensing 
device against a support with an impulse effective to 
deposit a selected volume in the volume range between 

5 0.01 to 100 nl. 

8. The apparatus of claim 6, wherein said channel 
is formed by a pair of spaced-apart tapered elements. 

9. The apparatus of claim 6, wherein the control 
means operates to (i) place the dispensing device at a 
loading station, (ii) move the capillary channel in the 
device into a selected reagent at the loading station, 
to load the dispensing device with the reagent, and 
(iii) dispense the reagent at a defined array position 
on each of the supports on said holder. 

10. The apparatus of claim 6, wherein the control 
device further operates, at the end of a dispensing 
cycle, to wash the dispensing device by (i) placing the 
dispensing device at a washing station, (ii) moving the 
capillary channel in the device into a wash fluid, to 
load the dispensing device with the fluid, and (iii) 
remove the wash fluid prior to loading the dispensing 
device with a fresh selected reagent. 

11. The apparatus of claim 6, wherein said device 
is one of a plurality of such devices which are carried 
on the arm for dispensing different analyte assay 

30 reagents at selected spaced array positions. 

12. A substrate with a surface having a 
microarray of at least 10 3 distinct polynucleotide or 
polypeptide biopolymers per 1 cm 2 surface area, each 
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distinct biopolymer sample (i) being disposed at a 
separate, defined position in said array, (ii) having a 
length of at least 50 subunits, and (iii) being present 
in a defined amount between about 0.1 femtomole and 100 
5 nanomoles. 

13. The substrate of claim 12 , wherein said 
surface is glass slide coated with polylysine, and said 
biopolymers are polynucleotides. 

10 

14. The substrate of claim 12, wherein said 
substrate has a water- impermeable backing, a water- 
permeable film formed on the backing, and a grid formed 
on the film, where said grid (i) is composed of 

15 intersecting water-impervious grid elements extending 

from said backing to positions raised above the surface 
of said film, and (ii) partitions the film into a 
plurality of water-impervious cells, where each cell 
contains such a biopolymer array. 

20 

15. A substrate with a surface array of sample- 
receiving cells, comprising 

a water- impermeable backing, 

a water-permeable film formed on the backing, and 
25 a grid formed on the film, said grid being composed of 
intersecting water- impervious grid elements extending 
from said backing to positions raised above the surface 
of said film. 

30 16. The substrate of claim 15, wherein the cells 

of the array each contain an array of biopolymers. 



35 



17. A substrate for use in detecting binding of 
labeled biopolymers to one or more of a plurality 
distinct polynucleotides, comprising 
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a non-porous, glass substrate, 

a coating of a cationic polymer on said substrate, 

and 

an array of distinct polynucleotides to said 
5 coating, where each biopolymer is disposed at a 
separate, defined position in a surface array of 
biopolymers. 

18. A method of detecting differential expression 

10 of each of a plurality of genes in a first cell type 
with respect to expression of the same genes in a 
second cell types, said method comprising 

producing fluorescence-labeled cDNA's from mRNA's 
isolated from the two cells types, where the cDNA's 

15 from the first and second cells are labeled with first 
and second different fluorescent reporters, 

adding a mixture of the labeled cDNA's from the 
two cell types to an array of polynucleotides 
representing a plurality of known genes derived from 

20 the two cell types, under conditions that result in 
hybridization of the cDNA's to complementary-sequence 
polynucleotides in the array; and 

examining the array by fluorescence under 
fluorescence excitation conditions in which (i) 

25 polynucleotides in the array that are hybridized 

predominantly to cDNA's derived from one of the first 
and second cell types give a distinct first or second 
fluorescence emission color, respectively, and (ii) 
polynucleotides in the array that are hybridized to 

30 substantially equal numbers of cDNA's derived from the 
first and second cell types give a distinct combined 
fluorescence emission color, respectively, 

wherein the relative expression of known genes in 
the two cell types can be determined by the observed 

35 fluorescence emission color of each spot. 
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19. The method of claim 18, wherein the array of 
polynucleotides is formed on a substrate with a surface 
having an array of at least 10 2 distinct polynucleotide 
or polypeptide biopolymers in a surface area of less 

5 than about 1 cm 2 , each distinct biopolymer (i) being 

disposed at a separate, defined position in said array, 
(ii) having a length of at least 50 subunits, and (iii) 
being present in a defined amount between about .1 
femtomole and 100 nmoles. 

0 

20. The method of claim 19, wherein said surface 
is a glass slide coated with poly lysine, and said 
biopolymers are polynucleotides non-covalently bound to 
said poly lysine. 
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ABSTRACT cDNA microarray technology is used to profile 
complex diseases and discover novel disease-related genes. In 
inflammatory disease such as rheumatoid arthritis, expression 
patterns of diverse cell types contribute to the pathology. We 
have monitored gene expression in this disease state with a 
microarray of selected human genes of probable significance in 
inflammation as well as with genes expressed in peripheral 
human blood cells. Messenger RNA from cultured macrophages, 
chondrocyte cell lines, primary chondrocytes, and synoviocytes 
provided expression profiles for the selected cytokines, chemo- 
kines, DNA binding proteins, and matrix-degrading metal- 
loproteinases. Comparisons between tissue samples of rheuma- 
toid arthritis and inflammatory bowel disease verified the in- 
volvement of many genes and revealed novel participation of the 
cytokine interleukin 3, chemokine Groa and the metal- 
loproteinase matrix metallo-elastase in both diseases. From the 
peripheral blood library, tissue inhibitor of metalloproteinase 1, 
ferritin light chain, and manganese superoxide dismutase genes 
were identified as expressed differentially in rheumatoid arthri- 
tis compared with inflammatory bowel disease. These results 
successfully demonstrate the use of the cDNA microarray system 
as a genera] approach for dissecting human diseases. 



The recently described cDNA microarray or DNA-chip tech- 
nology allows expression monitoring of hundreds and thou- 
sands of genes simultaneously and provides a format for 
identifying genes as well as changes in their activity (1, 2). 
Using this technology, two-color fluorescence patterns of 
differential gene expression in the root versus the shoot tissue 
of Arabidopsis were obtained in a specific array of 48 genes (1). 
In another study using a 1000 gene array from a human 
peripheral blood library, novel genes expressed by T cells were 
identified upon heat shock and protein kinase C activation (3). 

The technology uses cDNA sequences or cDNA inserts of a 
library for PCR amplification that are arrayed on a glass slide with 
high speed robotics at a density of 1000 cDNA sequences per cm 2 . 
These microarrays serve as gene targets for hybridization to 
cDNA probes prepared from RNA samples of cells or tissues. A 
two-color fluorescence labeling technique is used in the prepa- 
ration of the cDNA probes such that a simultaneous hybridization 
but separate detection of signals provides the comparative anal- 
ysis and the relative abundance of specific genes expressed (1, 2). 
Microarrays can be constructed from specific cDNA clones of 
interest, a cDNA library, or a select number of open reading 
frames from a genome sequencing database to allow a large-scale 
functional analysis of expressed sequences. 
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Because of the wide spectrum of genes and endogenous 
mediators involved, the microarray technology is well suited 
for analyzing chronic diseases. In rheumatoid arthritis (RA), 
inflammation of the joint is caused by the gene products of 
many different cell types present in the synovium and cartilage 
tissues plus those infiltrating from the circulating blood. The 
autoimmune and inflammatory nature of the disease is a 
cumulative result of genetic susceptibility factors and multiple 
responses, paracrine and autocrine in nature, from macro- 
phages, T cells, plasma cells, neutrophils, synovial fibroblasts, 
chondrocytes, etc. Growth factors, inflammatory cytokines 
(4), and the chemokines (5) are the important mediators of this 
inflammatory process. The ensuing destruction of the cartilage 
and bone by the invading synovial tissue includes the actions 
of prostaglandins and leukotrienes (6), and the matrix degrad- 
ing metal loproteinases (MMPs). The MMPs are an important 
class of Zn-dependent metallo-endoproteinases that can col- 
lectively degrade the proteoglycan and collagen components of 
the connective tissue matrix (7). 

This paper presents a study in which the involvement of 
select classes of molecules in RA was examined. Also inves- 
tigated were 1000 human genes randomly selected from a 
peripheral human blood cell library. Their differential and 
quantitative expression analysis in cells of the joint tissue, in 
diseased RA tissue and in inflammatory bowel disease (IBD) 
tissues was conducted to demonstrate the utility of the mi- 
croarray method to analyze complex diseases by their pattern 
of gene expression. Such a survey provides insight not only into 
the underlying cause of the pathology, but also provides the 
opportunity to selectively target genes for disease intervention 
by appropriate drug development and gene therapies. 

METHODS 

Microarray Design, Development, and Preparation. Two ap- 
proaches for the fabrication of cDNA microarrays were used in 
this study. In the first approach, known human genes of probable 
significance in RA were identified. Regions of the clones, pref- 
erably 1 kb in length, were selected by their proximity to the 3' end 
of the cDNA and for areas of least identity to related and 
repetitive sequences. Primers were synthesized to amplify the 
target regions by standard PCR protocols (3). Products were 



Abbreviations: RA, rheumatoid arthritis; MMP, matrix-degrading 
metalloproteinase; IBD, inflammatory bowel disease; LPS, lipopoly- 
saccharide; PMA, phorbol 12-myristate 13-acetate; TNF-a, tumor 
necrosis factor a; IL, interleukin; TGF-/3, transforming growth factor 
/3; GCSF, granulocyte colony-stimulating factor; MIP, macrophage 
inflammatory protein; MIF, migration inhibitory factor; HME, human 
matrix metallo-elastase; RANTES, regulated upon activation, normal 
T cell expressed and secreted; Gel, gelatinase; VCAM, vascular cell 
adhesion molecule; ICE, IL-1 converting enzyme; PUMP, putative 
metalloproteinase; MnSOD, manganese superoxide dismutase; TIMP, 
tissue inhibitor of metalloproteinase; MCP, macrophage chemotactic 
protein. 
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verified by gel electrophoresis and purified with Qiaquick 96-weIl 
purification kit (Qiagen, Chatsworth, CA), lyophilized (Savant), 
and resuspended in 5 \x\ of 3x standard saline citrate (SSC) buffer 
for arraying. In the second approach, the microarray containing 
the 1056 human genes from the peripheral blood lymphocyte 
library was prepared as described (3). 

Tissue Specimens. Rheumatoid synovial tissue was obtained 
from patients with late stage classic RA undergoing remedial 
synovectomy or arthroplasty of the knee. Synovial tissue was 
separated from any associated connective tissue or fat. One 
gram of each synovial specimen was subjected to RNA extrac- 
tion within 40 min of surgical excision, or explants were 
cultured in serum-free medium to examine any changes under 
in vitro conditions. For IBD, specimens of macroscopically 
inflamed lower intestinal mucosa were obtained from patients 
with Crohn disease undergoing remedial surgery. The hyper- 
trophied mucosal tissue was separated from underlying con- 
nective tissue and extracted for RNA. 

Cultured Cells. The Mono Mac-6 (MM6) monocytic cells 
(8) were grown in RPMI medium. Human chondrosarcoma 
SW1353 cells, primary human chondrocytes, and synoviocytes 
(9, 10) were cultured in DMEM; all culture media were 
supplemented with 10% fetal bovine serum, 100 /Jtg/ml strep- 
tomycin, and 500 units/ml penicillin. Treatment of cells with 
lipopolysaccharide (LPS) endotoxin at 30 ng/ml, phorbol 
12-myristate 13-acetate (PMA) at 50 ng/ml, tumor necrosis 
factor a (TNF-a) at 50 ng/ml, interleukin (IL)-l/3 at 30 ng/ml, 
or transforming growth factor-j3 (TGF-/3) at 100 ng/ml is 
described in the figure legends. 



Fluorescent Probe, Hybridization, and Scanning. Isolation of 
mRNA, probe preparation, and quantitation with Arabidopsis 
control mRNAs was essentially as described (3) except for the 
following minor modification. Following the reverse transcriptase 
step, the appropriate Cy3- and Cy5-labeled samples were pooled; 
mRNA degraded by heating the sample to 65°C for 10 min with 
the addition of 5 /itl of 0.5M NaOH plus 0.5 ml of 10 mM EDTA. 
The pooled cDNA was purified from unincorporated nucleotides 
by gel filtration in Centri-spin columns (Princeton Separations, 
Adelphia, NJ). Samples were lyophilized and dissolved in 6 ^Jtl of 
hybridization buffer (5x SSC plus 0.2% SDS). Hybridizations, 
washes, scanning, quantitation procedures, and pseudocolor rep- 
resentations of fluorescent images have been described (3). Scans 
for the two fluorescent probes were normalized either to the 
fluorescence intensity of Arabidopsis mRNAs spiked into the 
labeling reactions (see Figs. 2-4) or to the signal intensity of 
0-actin and glyceraldehyde-3-phosphate dehydrogenase 
(GAPDH; see Fig. 5). 

RESULTS 

Ninety-Six-Gene Microarray Design. The actions of cytokines, 
growth factors, chemokines, transcription factors, MMPs, pros- 
taglandins, and leukotrienes are well recognized in inflammatory 
disease, particularly R A (11-14). Fig. 1 displays the selected genes 
for this study and also includes control cDNAs of housekeeping 
genes such as ]3-actin and GAPDH and genes from Arabidopsis 
for signal normalization and quantitation (row A, columns 1-12). 

Defining Microarray Assay Conditions. Different lengths and 
concentrations of target DNA were tested by arraying PCR- 
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Fig. 1. Ninety-six-element microarray design. The target element name and the corresponding gene are shown in the layout. Some genes have 
more than one target element to guarantee specificity of signal. For TNF the targets represent decreasing lengths of 1, 0.8, 0.6, 0.4, and 0.2 kb from 
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amplified products ranging from 0.2 to 1.2 kb at concentrations 
of 1 txg/^\ or less. No significant difference in the signal levels was 
observed within this range of target size and only with 0.2-kb 
length was a signal reduced upon an 8-fold dilution of the 1 /ig//uJ 
sample (data not shown). In this study the average length of the 
targets was 1 kb, with a few exceptions in the range of ^300 bp, 
arrayed at a concentration of 1 jxg/jxl. Normally one PCR pro- 
vided sufficient material to fabricate up to 1000 microarray targets. 

In considering positional effects in the development of the 
targets for the microarrays, selection was biased toward the 3' 
proximal regions, because the signal was reduced if the target 
fragment was biased toward the 5' end (data not shown). Ill is 
result was anticipated since the hybridizing probe is prepared by 
reverse transcription with oligo(dT)-primed mRNA and is richer 
in 3' proximal sequences. Cross-hybridizations of probes to 
targets of a gene family were analyzed with the matrix metal- 



loproteinases as the example because they can show regions of 
sequence identities of greater than 70%. With collagenase-1 
(Col-1) and collagenase-2 (Col-2) genes as targets with up to 70% 
sequence identity, and stromelysin-1 (Strom-1) and stromelysin-2 
(Strom-2) genes with different degrees of identity, our results 
showed that a short region of overlap, even with 70-90% se- 
quence identity, produced a low level of cross-hybridization. 
However, shorter regions of identity spread over the length of the 
target resulted in cross-hybridization (data not shown). For 
closely related genes, targets were designed by avoiding long 
stretches of homology. For members of a gene family two or more 
target regions were included to discriminate between specificity 
of signal versus cross-hybridization. 

Monitoring Differential Expression in Cultured Cell Lines. In 
RA tissue, the monocyte/macrophage population plays a prom- 
inent role in phagocytic and immunomodulatory activities. Typ- 
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Fig. 2. Time course for LPS/PMA-induced MM6 cells. Array elements are described in Fig, I. (A) Pseudocolor representations of fluorescent 
scans correspond to gene expression levels at each time point. The array is made up of 8 Arabidopsis control targets and 86 human cDNA targets, 
the majority of which are genes with known or suspected involvement in inflammation. The color bars provide a comparative calibration scale 
between arrays and are derived from the Arabidopsis mRNA samples that are introduced in equal amounts during probe preparation. Fluorescent 
probes were made by labeling mRNA from untreated MM6 cells or LPS and PMA treated cells. mRNA was isolated at indicated times after 
induction. (B /-///) The two-color samples were cohybridized, and microarray scans provided the data for the levels of select transcripts at different 
time points relative to abundance at time zero. The analysis was performed using normalized data collected from 8-bit images. 
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ically these cells, when triggered by an immunogen, produce the 
proinflammatroy cytokines TNF and IL-1. We have used the 
monocyte cell line MM6 and monitored changes in gene expres- 
sion upon activation with LPS endotoxin, a component of Gram- 
negative bacterial membranes, and PMA, which augments the 
action of LPS on TNF production (15). RNA was isolated at 
different times after induction and used for cDNA probe prep- 
aration. From this time course it was clear that TNF expression 
was induced within 15 min of treatment, reached maximum levels 
in 1 hr, remained high until 4 hr and subsequently declined (Fig. 
2A). Many other cytokine genes were also transiently activated, 
such as IL-la and -0, IL-6, and granulocyte colony-stimulating 
factor (GCSF). Prominent chemokines activated were IL-8, mac- 
rophage inflammatory protein (MIP)-lj3, more so than MlP-la, 
and Groa or melanoma growth stimulatory factor. Migration 
inhibitory factor (MIF) expressed in the uninduced state declined 
in LPS-activated cells. Of the immediate early genes, the notice- 
able ones were c-fos,fra-l, c-jun, NF-KBp50, and IkB, with c-rel 
expression observed even in the uninduced state (Fig. IB). These 
expression patterns are consistent with reported patterns of 
activation of certain LPS- and PMA-induced genes (12). Dem- 
onstrated here is the unique ability of this system to allow parallel 
visualization of a large number of gene activities over a period of 
time. 

SW1353 cells is a line derived from malignant tumors of the 
cartilage and behaves much like the chondrocytes upon stim- 
ulation with TNF and IL-1 in the expression of MMPs (9). In 
addition to confirming our earlier observations with Northern 
blots on Strom-1, Col-1, and Col-3 expression (9), gelatinase 
(Gel) A, putative metalloproteinase (PUMP)-l membrane- 
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type matrix metalloproteinase, tissue inhibitors of matrix 
metalloproteinases or tissue inhibitor of metalloproteinase 1 
(TIMP-1), -2, and -3 were also expressed by these cells together 
with the human matrix metallo-elastase (HME; Fig. 3A). HME 
induction was estimated to be ^50-fold and was greater than 
any of the other MMPs examined (Fig. 3B). This result was 
unexpected because HME is reportedly expressed only by 
alveolar macrophage and placental cells (16). Expression of 
the cytokines and chemokines, IL-6, IL-8, MIF, and MIP-lj3 
was also noted. A variety of other genes, including certain 
transcription factors, were also up-regulated (Fig. 3), but the 
overall time-dependent expression of genes in the SW1353 
cells was qualitatively distinct from the MM6 cells. 

Quantitation of differential gene expression (Figs. IB and 
3B) was achieved with the simultaneous hybridization of 
Cy3-labeled cDNA from untreated cells and Cy5-labeled 
cDNA from treated samples. The estimated increases in 
expression from these microarrays for a select number of genes 
including IL-10, IL-8, MIP-1/3, TNF, HME, Col-1, Col-3, 
Strom-1, and Strom-2 were compared with data collected from 
dot blot analysis. Results (not shown) were in close agreement 
and confirmed our earlier observations on the use of the 
microarray method for the quantitation of gene expression (3). 

Expression Profiles in Primary Chondrocytes and Synovio- 
cytes of Human RA Tissue. Given the sensitivity and the 
specificity of this method, expression profiles of primary 
synoviocytes and chondrocytes from diseased tissue were 
examined. Without prior exposure to inducing agents, low level 
expression ofc-jun, GCSF, IL-3, TNF-/3, MIF, and RANTES 
(regulated upon activation, normal T cell expressed and se- 
creted) was seen as well as expression of MMPs, GelA, 
Strom-1, Col-1, and the three TIMPs. In this case, Col-2 
hybridization was considered to be nonspecific because the 
second Col-2 target taken from the 3' end of the gene gave no 

A. Human synovial fibroblasts B. Human articular chondrocytes 
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Fig. 3. Time course for IL-10 and TNF-induced SW1353 cells 
using the inflammation array (Fig. 1). {A) Pseudocolor representation 
of fluorescent scans correspond to gene expression levels at each time 
point. (B I-I V) Relative levels of selected genes at different time poi nts 
compared with time zero. 



Fig. 4. Expression profiles for early passage primary synoviocytes and 
chondrocytes isolated from RA tissue, cultured in the presence of 10% 
fetal calf serum and activated with PMA and IL-1/3, or TNF and IL-lft 
or TGF-/3 for 18 hr. The color bars provide a comparative calibration scale 
between arrays and are derived from the Ambidopsis mRNA samples that 
are introduced in equal amounts during probe preparation 
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signal. Treatment more so with PMA and IL-1, than TNF and 
IL-1, produced a dramatic up-regulation in expression of 
several genes in both of these primary cell types. These genes 
are as follows: the cytokine IL-6, the chemokines IL-8 and 
Gro-la, and the MMPs; Strom-1, Col-1, Col-3, and HME; and 
the adhesion molecule, vascular cell adhesion molecule 1 
(VCAM-1). The surprise again is HME expression in these 
primary cells, for reasons discussed above. From these results, 
the expression profiles of synoviocytes and the chondrocytes 
appear very similar; the differences are more quantitative than 
qualitative. Treatment of the primary chondrocytes with the 
anabolic growth factor TGF-/3 had an interesting profile in that 
it produced a remarkable down-regulation of genes expressed 
in both the untreated and induced state (Fig. 4). 

Given the demonstrated effectiveness of this technology, a 
comparative analysis of two different inflammatory disease 
states was conducted with probes made from RA tissue and 
IBD samples. RA samples were from late stage rheumatoid 
synovial tissue, and IBD specimens were obtained from in- 
flamed lower intestinal mucosa of patients with Crohn disease. 
With both the 96-element known gene microarray and the 
1000-gene microarray of cDNAs selected from a peripheral 
human blood cell library (3), distinct differences in gene 
expression patterns were evident. On the 96-gene array, RA 
tissue samples from different affected individuals gave similar 
profiles (data not shown) as did different samples from the 
same individual (Fig. 5). These patterns were notably similar 
to those observed with primary synoviocytes and chondrocytes 
(Fig. 4). Included in the list of prominently up-regulated genes 
are IL-6, the MMPs Strom-1, Col-1, . GelA, HME, and in 
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Fig. 5. Expression profiles of RA tissue (A) and IBD tissue (B). 
mRNA from R A tissue samples obtained from the same individual was 
isolated directly after excision (RA 21.5A) or maintained in culture 
without serum for 2 hr (RA 21.5B) or for 6 hr (RA 21.5C). Profiles 
from tissue samples of two other individuals (data not shown) were 
remarkably similar to the ones shown here; IBD-A and IBD-CI are 
from mRNA samples prepared directly after surgery from two sepa- 
rate individuals. For the IBD-CII probe, the tissue sample was cultured 
in medium without serum for 2 hr before mRNA preparation. 



certain samples PUMP, TIMPs, particularly TIMP-1 and 
TIMP-3, and the adhesion molecule VCAM Discernible levels 
of macrophage chemotactic protein 1 (MCP-1), MIF and 
R ANTES were also noted. IBD samples were in comparison, 
rather subdued although IL-1 converting enzyme (ICE), 
TIMP-1, and MIF were notable in all the three different IBD 
samples examined here. In IBD-A, one of three individual 
samples, ICE, VCAM, Groa, and MMP expression was more 
pronounced than in the others. 

We also made use of a peripheral blood cDNA library (3) 
to identify genes expressed by lymphocytes infiltrating the 
inflamed tissues from the circulating blood. With the 1046- 
element array of randomly selected cDNAs from this library, 
probes made from R A and IBD samples showed hybridizations 
to a large number of genes. Of these, many were common 
between the two disease tissues while others were differentially 
expressed (data not shown). A complete survey of these genes 
was beyond the scope of this study, but for this report we 
picked three genes that were up-regulated in the RA tissue 
relative to IBD. These cDNAs were sequenced and identified 
by comparison to the GenBank database. They are TIMP-1, 
apoferritin light chain, and manganese superoxide dismutase 
(MnSOD). Differential expression of MnSOD was only ob- 
served in samples of RA tissue explants maintained in growth 
medium without serum for anywhere between 2 to 16 hr. These 
results also indicate that the expression profile of genes can be 
altered when explants are transferred to culture conditions. 

DISCUSSION 

The speed, ease, and feasibility of simultaneously monitoring 
differential expression of hundreds of genes with the cDNA 
microarray based system (1-3) is demonstrated here in the 
analysis of a complex disease such as RA Many different cell 
types in the RA tissue; macrophages, lymphocytes, plasma cells, 
neutrophils, synoviocytes, chondrocytes, etc. are known to con- 
tribute to the development of the disease with the expression of 
gene products known to be proinflammatory. They include the 
cytokines, chemokines, growth factors, MMPs, eicosanoids, and 
others (7, 11-14), and the design of the 96-element known gene 
microarray was based on this knowledge and depended on the 
availability of the genes. The technology was validated by con- 
firming earlier observations on the expression of TNF by the 
monocyte cell line MM6, and of Col-1 and Col-3 expression in the 
chondrosarcoma cells and articular chondrocytes (9, 12). In our 
time-dependent survey the chronological order of gene activities 
in and between gene families was compared and the results have 
provided unprecedented profiles of the cytokines (TNF, IL-1, 
IL-6, GCSF, and MIF), chemokines (MIP-la, MIP-ljS, IL-8, and 
Gro-1), certain transcription factors, and the matrix metal- 
loproteinases (GelA, Strom-1, Col-1, Col-3, HME) in the mac- 
rophage cell line MM6 and in the SW1353 chondrosarcoma cells. 

Earlier reports of cytokine production in the diseased state had 
established a model in which TNF is a major participant in RA. 
Its expression reportedly preceded that of the other cytokines and 
effector molecules (4). Our results strongly support these results 
as demonstrated in the time course of the MM6 cells where TNF 
induction preceded that of IL-la and IL-0 followed by IL-6 and 
GCSF. These expression profiles demonstrate the utility of the 
microarrays in determining the hierarachy of signaling events. 

In the SW1353 chondrosarcoma cells, all the known MMPs and 
TIMPs were examined simultaneously. HME expression was 
discovered, which previously had been observed in only the 
stromal cells and alveolar macrophages of smoker's lungs and in 
placental tissue. Its presence in cells of the RA tissue is mean- 
ingful because its activity can cause significant destruction of 
elastin and basement membrane components (16, 17). Expression 
profiles of synovial fibroblasts and articular chondrocytes were 
remarkably similar and not too different from the SW1353 cells, 
indicating that the fibroblast and the chondrocyte can play equally 
aggressive roles in joint erosion. Prominent genes expressed were 
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the MMPs, but chemokines and cytokines were also produced by 
these cells. The effect of the anabolic growth factor TGF-0 was 
profoundly evident in demonstrating the down regulation of these 
catabolic activities. 

RA tissue samples undeniably reflected profiles similar to 
the cell types examined. Active genes observed were IL-3, IL-6, 
ICE, the MMPs including HME and TIMPs, chemokines IL-8, 
Groa, MIP, MIF, and RANTES, and the adhesion molecule 
VCAM. Of the growth factors, fibroblast growth factor 0 was 
observed most frequently. In comparison, the expression 
patterns in the other inflammatory state (i.e., IBD) were not 
as marked as in the R A samples, at least as obtained from the 
tissue samples selected for this study. 

As an alternative approach, the 1046 cDNA microarray of 
randomly selected genes from a lymphocyte library was used to 
identify genes expressed in RA tissue (3). Many genes on this 
array hybridized with probes made from both RA and IBD tissue 
samples. The results are not surprising because inflammatory 
tissue is abundantly supplied with cell types infiltrating from the 
circulating blood, made apparent also by the high levels of 
chemokine expression in RA tissue. Because of the magnitude of 
the effort required to identify all the hybridized genes, we have for 
this report chosen to describe only three differentially expressed 
genes mainly to verify this method of analysis. 

Of the large number of genes observed here, a fair number 
were already known as active participants in inflammatory dis- 
ease. These are TNF, IL-1, IL-6, IL-8, GCSF, RANTES, and 
VCAM. The novel participants not previously reported are 
HME, IL-3, ICE, and Groa. With our discovery of HME 
expression in RA, this gene becomes a target for drug interven- 
tion. ICE is a cysteine protease well known for its IL-lj8 process- 
ing activity (18), and recognized for its role in apoptotic cell death 
(19). Its expression in R A tissue is intriguing. IL-3 is recognized 
for its growth-promoting activity in hematopoietic cell lineages, is 
a product of activated T cells (20), and its expression in synovio- 
cytes and chondrocytes of RA tissue is a novel observation. 

Like IL-8, Groa, is a C-X-C subgroup chemokine and is a 
potent neutrophil and basophil chemoattractant. It down- 
regulates the expression of types I and III interstitial collagens 
(21, 22) and is seen here produced by the MM6 cells, in primary 
synoviocytes, and in RA tissue. With the presence of RANTES, 
MCP, and MIP-lj3, the C-C chemokines (23) migration and 
infiltration of monocytes, particularly T cells, into the tissue is 
also enhanced (5) and aid in the trafficking and recruitment of 
leukocytes into the RA tissue. Their activation, phagocytosis, 
degranulation, and respiratory bursts could be responsible for 
the induction of MnSOD in RA. MnSOD is also induced by 
TNF and IL-1 and serves a protective function against oxida- 
tive damage. The induction of the ferritin light chain encoding 
gene in this tissue may be for reasons similar to those for 
MnSOD. Ferritin is the major intracellular iron storage protein 
and it is responsive to intracellular oxidative stress and reactive 
oxygen intermediates generated during inflammation (24, 25). 
The active expression of TIMP-1 in R A tissue, as detected by 
the 1000-element array, is no surprise because our results have 
repeatedly shown TIMP-1 to be expressed in the constitutive 
and induced states of RA cells and tissues. 

The suitability of the cDNA microarray technology for 
profiling diseases and for identifying disease related genes is 
well documented here. This technology could provide new 



targets for drug development and disease therapies, and in 
doing so allow for improved treatment of chronic diseases that 
are challenging because of their complexity. 
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MEASUREMENT OF GENE EXPRESSION PROFILES 
IN TOXICITY DETERMINATION 

5 Field of the Invention 

The invention relates generally to methods for detecting and monitoring 
phenotypic changes in in vitro and in vivo systems for assessing and/or determining 
the toxicity of chemical compounds, and more particularly, the invention relates to a 
method for detecting and monitoring changes in gene expression patterns in in vitro 
1 0 and in vivo systems for determining the toxicity of drug candidates. 

BACKGROUND 

The ability to rapidly and conveniently assess the toxicity of new compounds 
is extremely important. Thousands of new compounds are synthesized every year, 

1 5 and many are introduced to the environment through the development of new 

commercial products and processes, often with little knowledge of their short term 
and long term health effects. In the development of new drugs, the cost of assessing 
the safety and efficacy of candidate compounds is becoming astronomical: It is 
.estimated that the pharmaceutical industry spends an average of about 300 million 

20 dollars to bring a new pharmaceutical compound to market, e.g. Biotechnology, 13: 
226-228 (1995). A large fraction of these costs are due to the failure of candidate 
compounds in the later stages of the developmental process. That is, as the 
assessment of a candidate drug progresses from the identification of a compound as a 
drug candidate-for example, through relatively inexpensive binding assays or in vitro 

25 screening assays, to pharmacokinetic studies, to toxicity studies, to efficacy studies in 
model systems, to preliminary clinical studies, and so on, the costs of the associated 
tests and analyses increases tremendously. Consequently, it may cost several tens of 
millions of dollars to determine that a once promising candidate compound possesses 
a side effect or cross reactivity that renders it commercially infeasible to develop 

30 further. A great challenge of pharmaceutical development is to remove from further 
consideration as early as possible those compounds that are likely to fail in the later 
stages of drug testing. 

Drug development prograrr s are clearly structured with this objective in mind; 
however, rapidly escalating costs have created a need to develop even more stringent 

35 and less expensive screens in the early stages to identify false leads as soon as 

possible. Toxicity assessment is an area where such improvements may be made, for 
both drug development and for assessing the environmental, health, and safety effects 
of new compounds in general. 
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Typically the toxicity of a compound is determined by administering the 
compound to one or more species of test animal under controlled conditions and by 
monitoring the effects on a wide range of parameters. The parameters include such 
things as blood chemistry, weight gain or loss, a variety of behavioral patterns, muscle 
tone, body temperature, respiration rate, lethality, and the like, which collectively 
provide a measure of the state of health of the test animal. The degree of deviation of 
such parameters from their normal ranges gives a measure of the toxicity of a 
compound. Such tests may be designed to assess the acute, prolonged, or chronic 
toxicity of a compound. In general, acute tests involve administration of the test 
chemical on one occasion. The period of observation of the test animals may be as 
short as a few hours, although it is usually at least 24 hours and in some cases it may 
be as long as a week or more. In general, prolonged tests involve administration of 
the test chemical on multiple occasions. The test chemical may be administered one 
or more times each day, irregularly as when it is incorporated in the diet, at specific 
times such as during pregnancy, or in some cases regularly but only at weekly 
intervals. Also, in the prolonged test the experiment is usually conducted for not less 
than 90 days in the rat or mouse or a year in the dog. In contrast to the acute and 
prolonged types of test, the chronic toxicity tests are those in which the test chemical 
. is administered for a substantial portion of the lifetime of the test animal. In the case 
of the mouse or rat, this is a period of 2 to 3 years. In the case of the dog, it is for 5 to 
7 years. 

Significant costs are incurred in establishing and maintaining large cohorts of 
test animals for such assays, especially the larger animals in chronic toxicity assays. 
Moreover, because of species specific effects, passing such toxicity tests does not 
ensure that a compound is free of toxic effects when used in humans. Such tests do. 
however, provide a standardized set of information forjudging the safety of new 
compounds, and they provide a database for giving preliminary assessments of related 
compounds. An important area for improving toxicity determination would be the 
identification of new observables which are predictive of the outcome of the 
expensive and tedious animal assays. 

In other medical fields, there has been significant interest in applying recent 
advances in biotechnology, particularly in DNA sequencing/to the identification and 
study of differentially expressed genes in healthy and diseased organisms, e.g. Adams 
et al, Science, 252: 1651-1656 (1991); Matsubara et al, Gene, 135: 265-274 (1993); 
Rosenberg et al, International patent application, PCT/US95/01863. The objectives 
of such applications include increasing our knowledge of disease processes, 
identifying genes that play important roles in the disease process, and providing 
diagnostic and therapeutic approaches that exploit the expressed genes or their 
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products. While such approaches are attractive, those based on exhaustive, or even 
sampled, sequencing of expressed genes are still beset by the enormous effort 
required: It is estimated that 30-35 thousand different genes are expressed in a typical 
mammalian tissue in any given state, e.g. Ausubel et al, Editors, Current Protocols, 
5 5.8.1-5.8.4 (John Wiley & Sons, New York, 1992). Determining the sequences of 
even a small sample of that number of gene products is a major enterprise, requiring 
industrial-scale resources. Thus, the routine application of massive sequencing of 
expressed genes is still beyond current commercial technology. 

The availability of new assays for assessing the toxicity of compounds, such 
1 0 as candidate drugs, that would provide more comprehensive and precise information 
about the state of health of a test animal would be highly desirable. Such additional 
assays would preferably be less expensive, more rapid, and more convenient than 
current testing procedures, and would at the same time provide enough information to 
make early judgments regarding the safety of new compounds. 

15 

Summary of the Invention 
An object of the invention is to provide a new approach to toxicity assessment 
based on an examination of gene expression patterns, or profiles, in in vitro or in vivo 
.test systems. 

20 Another object of the invention is to provide a database on which to base 

decisions concerning the toxicological properties of chemicals, particularly drug 
candidates. 

A further object of the invention is to provide a method for analyzing gene 
expression patterns in selected tissues of test animals. 
25 A still further object of the invention is to provide a system for identifying 

genes which are differentially expressed in response to exposure to a test compound. 
Another object of the invention is to provide a rapid and reliable method for 
correlating gene expression with short term and long term toxicity in test animals. 
Another object of the invention is to identify genes whose expression is 
30 predictive of deleterious toxicity. 

The invention achieves these and other objects by providing a method for 
massively parallel signature sequencing of genes expressed in one or more selected 
tissues of an organism exposed to a test compound. An important feature of the 
invention is the application of novel DNA sorting and sequencing methodologies that 
35 permit the formation of gene expression profiles for selected tissues by determining 
the sequence of portions of many thousands of different polynucleotides in parallel. 
Such profiles may be compared with those from tissues of control organisms at single 
or multiple time points to identify expression patterns predictive of toxicity. 
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The sorting methodology of the invention makes use of oligonucleotide tags 
that are members of a minimally cross-hybridizing set of oligonucleotides. The 
sequences of oligonucleotides of such a set differ from the sequences of every other 
member of the same set by at least two nucleotides. Thus, each member of such a set 
cannot form a duplex (or triplex) with the complement of any other member with less 
than two mismatches. Complements of oligonucleotide tags of the invention, referred 
to herein as "tag complements," may comprise natural nucleotides or non-natural 
nucleotide analogs. Preferably, tag complements are attached to solid phase supports. 
Such oligonucleotide tags when used with their corresponding tag complements 
provide a means of enhancing specificity of hybridization for sorting polynucleotides, 
such as cDNAs. 

The polynucleotides to be sorted each have an oligonucleotide tag attached, 
such that different polynucleotides have different tags. As explained more fully 
below, this condition is achieved by employing a repertoire of tags substantially 
greater than the population of polynucleotides and by taking a sufficiently small 
sample of tagged polynucleotides from the full ensemble of tagged polynucleotides. 
After such sampling, when the populations of supports and polynucleotides are mixed 
under conditions which permit specific hybridization of the oligonucleotide tags with 
.their respective complements, identical polynucleotides sort onto particular beads or 
regions. The sorted populations of polynucleotides can then be sequenced on the 
solid phase support by a "single-base" or "base-by-base" sequencing methodology, as 
described more fully below. 

In one aspect, the method of the invention comprises the following steps: (a) 
administering the compound to a test organism; (b) extracting a population of mRNA 
molecules from each of one or more tissues of the test organism; (c) forming a 
separate population of cDNA molecules from each population of mRNA molecules 
extracted from the one or more tissues such that each cDNA molecule of the separate 
populations has an oligonucleotide tag attached, the oligonucleotide tags being 
selected from the same minimally cross-hybridizing set; (d) separately sampling each 
population of cDNA molecules such that substantially all different cDNA molecules 
within a separate population have different oligonucleotide tags attached; (e) sorting 
the cDNA molecules of each separate population by specifically hybridizing the 
oligonucleotide tags with their respective complements, the respective complements 
being attached as uniform populations of substantially identical complements in 
spatially discrete regions on one or more solid phase supports; (0 determining the 
nucleotide sequence of a portion of each of the sorted cDNA molecules of each 
separate population to form a frequency distribution of expressed genes for each of 
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the one or more tissues; and (g) correlating the frequency distribution of expressed 
genes in each of the one or more tissues with the toxicity of the compound. 

An important aspect of the invention is the identification of genes whose 
expression is predictive of the toxicity of a compound. Once such genes are 
5 identified, they may be employed in conventional assays, such as reverse transcriptase 
polymerase chain reaction (RT-PCR) assays for gene expression. 

Brief Description of the Drawings 
Figure 1 is a flow chart representation of an algorithm for generating 
1 0 minimally cross-hybridizing sets of oligonucleotides. 

Figure 2 diagrammatically illustrates an apparatus for carrying out 
polynucleotide sequencing in accordance with the invention. 

Definitions 

] 5 "Complement" or "tag complement" as used herein in reference to 

oligonucleotide tags refers to an oligonucleotide to which a oligonucleotide tag 
specifically hybridizes to form a perfectly matched duplex or triplex. In embodiments 
where specific hybridization results in a triplex, the oligonucleotide tag may be 
.selected to be either double stranded or single stranded. Thus, where triplexes are 

20 formed, the term "complement" is meant to encompass either a double stranded 

complement of a single stranded oligonucleotide tag or a single stranded complement 
of a double stranded oligonucleotide tag. 

The term "oligonucleotide" as used herein includes linear oligomers of natural 
or modified monomers or linkages, including deoxyribonucleosides, ribonucleosides, 

25 anomeric forms thereof, peptide nucleic acids (PNAs), and the like, capable of 
specifically binding to a target polynucleotide by way of a regular pattern of 
monomer-to-monomer interactions, such as Watson-Crick type of base pairing, base 
stacking, Hoogsteen or reverse Hoogsteen types of base pairing, or the like. Usually 
monomers are linked by phosphodiester bonds or analogs thereof to form 

30 oligonucleotides ranging in size from a few monomeric units, e.g. 3-4, to several tens 
of monomeric units. Whenever an oligonucleotide is represented by a sequence of 
letters, such as "ATGCCTG," it will be understood that the nucleotides are in 5'->3' 
order from left to right and that "A" denotes deoxyadenosine, "C" denotes 
deoxycytidine, "G" denotes deoxyguanosine, and "T" denotes thymidine, unless 

35 otherwise noted. Analogs of phosphodiester linkages include phosphorothioate, 
phosphorodithioate, phosphoranilidate, phosphoramidate, and the like. Usually 
oligonucleotides of the invention comprise the four natural nucleotides; however, they 
may also comprise non-natural nucleotide analogs. It is clear to those skilled in the 
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art when oligonucleotides having natural or non-natural nucleotides may be 
employed, e.g. where processing by enzymes is called for, usually oligonucleotides 
consisting of natural nucleotides are required. 

"Perfectly matched" in reference to a duplex means that the poly- or 
oligonucleotide strands making up the duplex form a double stranded structure with 
one other such that every nucleotide in each strand undergoes Watson-Crick 
basepairing with a nucleotide in the other strand. The term also comprehends the 
pairing of nucleoside analogs, such as deoxyinosine, nucleosides with 2-aminopurine 
bases, and the like, that may be employed. In reference to a triplex, the term means 
that the triplex consists of a perfectly matched duplex and a third strand in which 
every nucleotide undergoes Hoogsteen or reverse Hoogsteen association with a 
basepair of the perfectly matched duplex. Conversely, a "mismatch" in a duplex 
between a tag and an oligonucleotide means that a pair or triplet of nucleotides in the 
duplex or triplex fails to undergo Watson-Crick and/or Hoogsteen and/or reverse 
Hoogsteen bonding. 

As used herein, "nucleoside" includes the natural nucleosides, including 2'- 
deoxy and 2'-hydroxyl forms, e.g. as described in Komberg and Baker, DNA 
Replication, 2nd Ed. (Freeman, San Francisco, 1992). "Analogs" in reference to 
. nucleosides includes synthetic nucleosides having modified base moieties and/or 
modified sugar moieties, e.g. described by Scheit, Nucleotide Analogs (John Wiley, 
New York, 1980); Uhlman and Peyman, Chemical Reviews, 90: 543-584 (1990). or 
the like, with the only proviso that they are capable of specific hybridization. Such 
analogs include synthetic nucleosides designed to enhance binding properties, reduce 
complexity, increase specificity, and the like. 

As used herein "sequence determination" or "determining a nucleotide 
sequence" in reference to polynucleotides includes determination of partial as well as 
full sequence information of the polynucleotide. That is, the term includes sequence 
comparisons, fingerprinting, and like levels of information about a target 
polynucleotide, as well as the express identification and ordering of nucleosides, 
usually each nucleoside, in a target polynucleotide. The term also includes the 
determination of the identification, ordering, and locations of one, two, or three of the 
four types of nucleotides within a target polynucleotide. For example, in some 
embodiments sequence determination may be effected by identifying the ordering and 
locations of a single type of nucleotide, e.g. cytosines, within the target polynucleotide 
"CATCGC ..." so that its sequence is represented as a binary code, e.g. "100101 ... " for 
"C-(not C)-(not C)-C-(not C)-C ... " and the like. 
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As used herein, the term "complexity" in reference to a population of 
polynucleotides means .the number of different species of molecule present in the 
population. 

As used herein, the terms "gene expression profile," and "gene expression 
5 pattern" which is used equivalently, means a frequency distribution of sequences of 
portions of cDNA molecules sampled from a population of tag-cDNA conjugates. 
Generally, the portions of sequence are sufficiently long to uniquely identify the 
cDNA from which the portion arose. Preferably, the total number of sequences 
determined is at least 1000; more preferably, the total number of sequences 
1 0 determined in a gene expression profile is at least ten thousand. 

As used herein, "test organism" means any in vitro or in vivo system which 
provides measureable responses to exposure to test compounds. Typically, test 
organisms may be mammalian cell cultures, particularly of specific tissues, such as 
hepatocytes, neurons, kidney cells, colony forming cells, or the like, or test organisms 
1 5 may be whole animals, such as rats, mice, hamsters, guinea pigs, dogs, cats, rabbits, 
pigs, monkeys, and the like. 

Detailed Description of the Invention 
The invention provides a method for determining the toxicity of a compound 

20 by analyzing changes in the gene expression profiles in selected tissues of test 
organisms exposed to the compound. The invention also provides a method of 
identifying toxicity markers consisting of individual genes or a group of genes that is 
expressed acutely and which is correlated with prolonged or chronic toxicity, or 
suggests that the compound will have an undesirable cross reactivity. Gene 

25 expression profiles are generated by sequencing portions of cDNA molecules 
construction from mRNA extracted from tissues of test organisms exposed to the 
compound being tested. As used herein, the term "tissue" is employed with its usual 
medical or biological meaning, except that in reference to an in vitro test system, such 
as a cell culture, it simply means a sample from the culture. Gene expression profiles 

30 derived from test organisms are compared to gene expression profiles derived from 
control organisms to determine the genes which are differentially expressed in the test 
organism because of exposure to the compound being tested. In both cases, the 
sequence information of the gene expression profiles is obtained by massively parallel 
signature sequencing of cDNAs, which is implemented in steps (c) through (f) of the 

35 above method. 

Toxicity Assessment 
Procedures for designing and conducting toxicity tests in in vitro and in vivo 
systems is well known, and is described in many texts on the subject, such as Loomis 
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et al. Loomis's Esstentials of Toxicology, 4th Ed. (Academic Press, New York ? 1996); 
Echobichon, The Basics of Toxicity Testing (CRC Press, Boca Raton, 1992); Frazier, 
editor. In Vitro Toxicity Testing (Marcel Dekker, New York, 1992); and the like. 

In toxicity testing, two groups of test organisms are usually employed: one 
group serves as a control and the other group receives the test compound in a single 
dose (for acute toxicity tests) or a regimen of doses (for prolonged or chronic toxicity 
tests). Since in most cases, the extraction of tissue as called for in the method of the 
invention requires sacrificing the test animal, both the control group and the group 
receiving compound must be large enough to permit removal of animals for sampling 
tissues, if it is desired to observe the dynamics of gene expression through the 
duration of an experiment. 

In setting up a toxicity study, extensive guidance is provided in the literature 
for selecting the appropriate test organism for the compound being tested, route of 
administration, dose ranges, and the like. Water or physiological saline (0.9% NaCl 
in water) is the solute of choice for the test compound since these solvents permit 
administration by a variety of routes. When this is not possible because of solubility 
limitations, it is necessary to resort to the use of vegetable oils such as corn oil or 
even organic solvents, of which propylene glycol is commonly used. Whenever 
.possible the use of suspension of emulsion should be avoided except for oral 
administration. Regardless of the route of administration, the volume required to 
administer a given dose is limited by the size of the animal that is used. It is desirable 
to keep the volume of each dose uniform within and between groups of animals. 
When rates or mice are used the volume administered by the oral route should not 
exceed 0.005 ml per gram of animal. Even when aqueous or physiological saline 
solutions are used for parenteral injection the volumes that are tolerated are limited, 
although such solutions are ordinarily thought of as being innocuous. The 
intravenous LD 50 of distilled water in the mouse is approximately 0.044 ml per gram 
and that of isotonic saline is 0.068 ml per gram of mouse. 

When a compound is to be administered by inhalation, special techniques for 
generating test atmospheres are necessary. Dose estimation becomes very 
complicated. The methods usually involve aerosolization or nebulization of fluids 
containing the compound. If the agent to be tested is a fluid that has an appreciable 
vapor pressure, it may be administered by passing air through the solution under 
controlled temperature conditions. Under these conditions, dose is estimated from the 
volume of air inhaled per unit time, the temperature of the solution, and the vapor 
pressure of the agent involved. Gases are metered from reservoirs. When particles of 
a solution are to be administered, unless the particle size is less than about 2 the 
particles will not reach the terminal alveolar sacs in the lungs. A variety of 
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apparatuses and chambers are available to perform studies for detecting effects of 
irritant or other toxic endpoints when they are administered by inhalation. The 
preferred method of administering an agent to animals is via the oral route, either by 
intubation or by incorporating the agent in the feed. 
5 Preferably, in designing a toxicity assessment, two or more species should be 

employed that handle the test compound as similarly to man as possible in terms of 
metabolism, absorption, excretion, tissue storage, and the like. Preferably, multiple 
doses or regimens at different concentrations should be employed to establish a dose- 
response relationship with respect to toxic effects. And preferably, the route of 

1 0 administration to the test animal should be the same as, or as similar as possible to, 
the route of administration of the compound to man. Effects obtained by one route of 
administration to test animals are not a priori applicable to effects by another route of 
administration to man. For example, food additives for man should be tested by 
admixture of the material in the diet of the test animals. 

1 5 Acute toxicity tests consist of administering a compound to test organisms on 

one occasion. The purpose of such test is to determine the symptomotology 
consequent to administration of the compound and to determine the degree of lethality 
of the compound. The initial procedure is to perform a series of range-finding doses 
.of the compound in a single species. This necessitates selection of a route of 

20 administration, preparation of the compound in a form suitable for administration by 
the selected route, and selection of an appropriate species. Preferably, initial acute 
toxicity studies are performed on either rats or mice because of their low cost, their 
availability, and the availability of abundant toxicologic reference data on these 
species. Prolonged toxicity tests consist of administering a compound to test 

25 organisms repeatedly, usually on a daily basis, over a period of 3 to 4 months. Two 
practical factors are encountered that place constraints on the design of such tests: 
First, the available routes of administration are limited because the route selected 
must be suitable for repeated administration without inducing harmful effects. And 
second, blood,- urine, and perhaps other samples, should be taken repeatedly without 

30 inducing significant harm to the test animals. Preferably, in the method of the 
invention the gene expression profiles are obtained in conjunction with the 
measurement of the traditional toxicologic parameters, such as listed in the table 
below: 
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Hematology 



Blood Chemistry 



Urine Analyses 



erythrocyte count 
total leukocyte count 
differential leukocyte 
count 
hematocrit 
hemoglobin 



sodium 
potassium 
chloride 

calcium 

carbon dioxide 

serum glutamine-pyruvate 

transaminase 

serum glutamin-oxalacetic 

transaminase 

serum protein 

electrophoresis 

blood sugar 

blood urea nitrogen 

total serum protein 

serum albumin 

total serum bilirubin 



PH 

specific gravity 
total protein 

sediment 

glucose 

ketones 

bilirubin 



Oligonucleotide Tags and Tag Comp lements 
Oligonucleotide tags are members of a minimally cross-hybridizing set of 
oligonucleotides. The sequences of oligonucleotides of such a set differ from the 
sequences of every other member of the same set by at least two nucleotides. Thus, 
each member of such a set cannot form a duplex (or triplex) with the complement of 
any other member with less than two mismatches. Complements of oligonucleotide 
tags, referred to herein as "tag complements," may comprise natural nucleotides or 
non-natural nucleotide analogs. Preferably, tag complements are attached to solid 
phase supports. Such oligonucleotide tags when used with their corresponding tag 
complements provide a means of enhancing specificity of hybridization for sorting, 
tracking, or labeling molecules, especially polynucleotides. 

Minimally cross-hybridizing sets of oligonucleotide tags and tag complements 
may be synthesized either combinatorial^ or individually depending on the size of the 
set desired and the degree to which cross-hybridization is sought to be minimized (or 
stated another way, the degree to which specificity is sought to be enhanced). For 
example, a minimally cross-hybridizing set may consist of a set of individually 
synthesized 10-mer sequences that differ from each other by at least 4 nucleotides, 
such set having a maximum size of 332 (when composed of 3 kinds of nucleotides 
and counted using a computer program such as disclosed in Appendix Ic). 
Alternatively, a minimally cross-hybridizing set of oligonucleotide tags may also be 
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assembled combinatorial ly from subunits which themselves are selected from a 
minimally cross-hybridizing set. For example, a set of minimally cross-hybridizing 
12-mers differing from one another by at least three nucleotides may be synthesized 
by assembling 3 subunits selected from a set of minimally cross-hybridizing 4-mers 
5 that each differ from one another by three nucleotides. Such an embodiment gives a 
maximally sized set of 9 3 , or 729, 12-mers. The number 9 is number of 
oligonucleotides listed by the computer program of Appendix la, which assumes, as 
with the 10-mers, that only 3 of the 4 different types of nucleotides are used. The set 
is described as "maximal" because the computer programs of Appendices Ia-c provide 

1 0 the largest set for a given input (e.g. length, composition, difference in number of 
nucleotides between members). Additional minimally cross-hybridizing sets may be 
formed from subsets of such calculated sets. 

Oligonucleotide tags may be single stranded and be designed for specific 
hybridization to single stranded tag complements by duplex formation or for specific 

1 5 hybridization to double stranded tag complements by triplex formation. 

Oligonucleotide tags may also be double stranded and be designed for specific 
hybridization to single stranded tag complements by triplex formation. 

When synthesized combinatorial^, an oligonucleotide tag preferably consists 
.of a plurality of subunits, each subunit consisting of an oligonucleotide of 3 to 9 

20 nucleotides in length wherein each subunit is selected from the same minimally cross- 
hybridizing set. In such embodiments, the number of oligonucleotide tags available 
depends on the number of subunits per tag and on the length of the subunits. The 
number is generally much less than the number of all possible sequences the length of 
the tag, which for a tag n nucleotides long would be 4 n . 

25 Complements of oligonucleotide tags attached to a solid phase support are 

used to sort polynucleotides from a mixture of polynucleotides each containing a tag. 
Complements of the oligonucleotide tags are synthesized on the surface of a solid 
phase support, such as a microscopic bead or a specific location on an array of 
synthesis locations on a single support, such that populations of identical sequences 

30 are produced in specific regions. That is, the surface of each support, in the case of a 
bead, or of each region, in the case of an array, is derivatized by only one type of 
complement which has a particular sequence. The population of such beads or regions 
contains a repertoire of complements with distinct sequences. As used herein in 
reference to oligonucleotide tags and tag complements, the term "repertoire" means 

35 the set of minimally cross-hybridizing set of oligonucleotides that make up the tags in 
a particular embodiment or the corresponding set of tag complements. 

The polynucleotides to be sorted each have an oligonucleotide tag attached, 
such that different polynucleotides have different tags. As explained more fully 
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10 



below, this condition is achieved by employing a repertoire of tags substantially 
greater than the population of polynucleotides and by taking a sufficiently small 
sample of tagged polynucleotides from the full ensemble of tagged polynucleotides. 
After such sampling, when the populations of supports and polynucleotides are mixed 
under conditions which permit specific hybridization of the oligonucleotide tags with 
their respective complements, identical polynucleotides sort onto particular beads or 
regions. 

The nucleotide sequences of oligonucleotides of a minimally cross-hybridizing 
set are conveniently enumerated by simple computer programs, such as those 
exemplified by programs whose source codes are listed in Appendices la and lb. 
Program minhx of Appendix la computes all minimally cross-hybridizing sets having 
4-mer subunits composed of three kinds of nucleotides. Program tagN of Appendix 
lb enumerates longer oligonucleotides of a minimally cross-hybridizing set. Similar 
algorithms and computer programs are readily written for listing oligonucleotides of 
1 5 minimally cross-hybridizing sets for any embodiment of the invention. Table I below 
provides guidance as to the size of sets of minimally cross-hybridizing 
oligonucleotides for the indicated lengths and number of nucleotide differences. The 
above computer programs were used to generate the numbers. 

20 Table I 

Nucleotide 
Difference 

between Maximal Size 

Oligonucleotides of Minimally Size of 

Oligonucleotid of Minimally Cross- Repertoire Size of 

e Cross- Hybridizing with Four Repertoire with 



Word Hybridizing Set Set Words 

Length 



Five Words 



4 


3 


9 


6561 


5.90 x \0 A 


6 


3 


27 


5.3 x I0 5 


1.43 x'10 7 


7 


4 


27 


5.3 x 10 5 


1.43 x 10 7 


7 


5 


8 


4096 


3.28 x 10 4 


8 


3 


190 


1.30 x 10 9 


2.48 x 10 11 


8 


4 


62 


1.48 x 10 7 


9.16 x 10 8 


8 


5 


18 


1.05 x 10 5 


1.89 x 10 6 


9 


5 


39 


2.31 x I0 6 


9.02 x 10 7 


10 


5 


332 


1.21 xlO 10 




10 


6 


28 


6.15 x 10 5 


1.72 x I0 7 


11 


5 


187 






18 


6 


*25000 
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For some embodiments of the invention, where extremely large repertoires of 
tags are not required, oligonucleotide tags of a minimally cross-hybridizing set may 
be separately synthesized. Sets containing several hundred to several thousands, or 
5 even several tens of thousands, of oligonucleotides may be synthesized directly by a 
variety of parallel synthesis approaches, e.g. as disclosed in Frank et al, U.S. patent 
4,689,405; Frank et al, Nucleic Acids Research, 1 1 : 4365-4377 (1983); Matson et al, 
Anal. Biochem., 224: 1 10-1 16 (1995); Fodor et al, International application 
PCT/US93/04145; Pease et al, Proc. Natl. Acad. Sci., 91 : 5022-5026 (1994); 

10 Southern et al, J. Biotechnology, 35: 217-227 (1994), Brennan, International 

application PCT/US94/05896; Lashkari et al, Proc. Natl. Acad. Sci., 92: 7912-7915 
(1995); or the like. 

Preferably, oligonucleotide tags of the invention are synthesized 
combinatorially out of subunits between three and six nucleotides in length and 

1 5 selected from the same minimally cross-hybridizing set. For oligonucletides in this 
range, the members of such sets may be enumerated by computer programs based on 
the algorithm of Fig. 1 . 

The algorithm of Fig. 1 is implemented by first defining the characteristics of 
the subunits of the minimally cross-hybridizing set, i.e. length, number of base 

20 differences between members, and composition, e.g. do they consist of two, three, or 
four kinds of bases. A table M n , n=l, is generated (100) that consists of all possible 
sequences of a given length and composition. An initial subunit S\ is selected and 
compared (120) with successive subunits Sj for i=n+l to the end of the table. 
Whenever a successive subunit has the required number of mismatches to be a 

25 member of the minimally cross-hybridizing set, it is saved in a new table M n +] (125), 
that also contains subunits previously selected in prior passes through step 120. For 
example, in the first set of comparisons, M2 will contain S\ ; in the second set of 
comparisons, M3 will contain S\ and S2; in the third set of comparisons, M4 will 
contain S \ , S2, and S3; and so on. Similarly, comparisons in table Mj will be 

30 between Sj and all successive subunits in Mj. Note that each successive table M n +j 
is smaller than its predecessors as subunits are eliminated in successive passes 
through step 130. After every subunit of table M n has been compared (140) the old 
table is replaced by the new table M n + j , and the next round of comparisons are 
begun. The process stops (160) when a table M n is reached that contains no 

35 successive subunits to compare to the selected subunit Sj, i.e. M n =M n +i . 

Preferably, minimally cross-hybridizing sets comprise subunits that make 
approximately equivalent contributions to duplex stability as every other subunit in 
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the set. In this way, the stability of perfectly matched duplexes between every subunit 
and its complement is approximately equal. Guidance for selecting such sets is 
provided by published techniques for selecting optimal PCR primers and calculating 
duplex stabilities, e.g. Rychlik et al, Nucleic Acids Research, 17: 8543-8551 (1989) 

5 and 18: 6409-6412 (1990); Breslauer et al, Proc. Natl. Acad. Sci., 83: 3746-3750 
(1986); Wetmur, Crit. Rev. Biochem. Mol. Biol., 26: 227-259 (1991);and the like. 
For shorter tags, e.g. about 30 nucleotides or less, the algorithm described by Rychlik 
and Wetmur is preferred, and for longer tags, e.g. about 30-35 nucleotides or greater, 
an algorithm disclosed by Suggs et al, pages 683-693 in Brown, editor, ICN-UCLA 

0 Symp. Dev. Biol., Vol. 23 (Academic Press, New York, 1 98 1 ) may be conveniently 
employed. Clearly, the are many approaches available to one skilled in the art for 
designing sets of minimally cross-hybridizing subunits within the scope of the 
invention. For example, to minimize the affects of different base-stacking energies of 
terminal nucleotides when subunits are assembled, subunits may be provided that 

5 have the same terminal nucleotides. In this way, when subunits are linked, the sum of 
the base-stacking energies of all the adjoining terminal nucleotides will be the same, 
thereby reducing or eliminating variability in tag melting temperatures. 

A "word" of terminal nucleotides, shown in italic below, may also be added to 
- each end of a tag so that a perfect match is always formed between it and a similar 

0 terminal "word" on any other tag complement. Such an augmented tag would have 
the form: 



w 


w, 


w 2 ... w k _, 


w k 


w 


w 


wr 


W 2 ' ... W t ,' 


W k " 


w 



where the primed W's indicate complements. With ends of tags always forming 
perfectly matched duplexes, all mismatched words will be internal mismatches 
thereby reducing the stability of tag-complement duplexes that otherwise would have 
mismatched words at their ends. It is well known that duplexes with internal 
mismatches are significantly less stable than duplexes with the same mismatch at a 
terminus. 

A preferred embodiment of minimally cross-hybridizing sets are those whose 
subunits are made up of three of the four natural nucleotides. As will be discussed 
more fully below, the absence of one type of nucleotide in the oligonucleotide tags 
permits target polynucleotides to be loaded onto solid phase supports by use of the 
5'->3' exonuclease activity of a DNA polymerase. The following is an exemplary 
minimally cross-hybridizing set of subunits each comprising four nucleotides selected 
from the group consisting of A, G, and T: 
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Table II 

Word: w, w 2 w 3 w 4 

Sequence: GATT TGAT TAGA TTTG 



Word: 

Sequence : 



w 5 



w 6 



w 7 



GTAA AGTA ATGT AAAG 



In this set. each member would form a duplex having three mismatched bases with 
1 0 the complement of every other member. 

Further exemplary minimally cross-hybridizing sets are listed below in Table 
III. Clearly, additional sets can be generated by substituting different groups of 
'nucleotides, or by using subsets of known minimally cross-hybridizing sets. 
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Table III 

Exemplary Minimally Cross-Hvbridizing Sets of 4-mer Subunits 



Set 1 


Set 2 


Set 3 


Set 4 


Set 5 


Set 6 


CATT 


ACCC 


AAAC 


AAAG 


AACA 


AACG 


CTAA 


AGGG 


ACCA 


ACCA 


ACAC 


ACAA 


TCAT 


CACG 


AGGG 


AGGC 


AGGG 


AGGC 


ACTA 


CCGA 


CACG 


CACC 


CAAG 


CAAC 


TACA 


CGAC 


CCGC 


CCGG 


CCGC 


CCGG 


TTTC 


GAGC 


CGAA 


CGAA 


CGCA 


CGCA 


ATCT 


GCAG 


GAGA 


GAGA 


GAGA 


GAGA 


AAAC 


GGCA 


GCAG 


GCAC 


GCCG 


GCCC 




AAAA 


GGCC 


GGCG 


GGAC 


GGAG 
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Set I 

AAGA 

ACAC 

AGCG 

CAAG 

CCCA 

CGGC 

GACC 

GCGG 

GGAA 



Set 8 

AAGC 

ACAA 

AGCG 

CAAG 

CCCC 

CGGA 

GACA 

GCGG 

GGAC 



Set 9 

AAGG 

ACAA 

AGCC 

CAAC 

CCCG 

CGGA 

GACA 

GCGC 

GGAG 



Set 10 
ACAG 
AACA 
AGGC 
CAAC 
CCGA 
CGCG 
GAGG 
GCCC 
GGAA 



Set 11 
ACCG 
AAAA 
AGGC 
CACC 
CCGA 
CGAG 
GAGG 
GCAC 
GGCA 



Set 12 
ACGA 
AAAC 
AGCG 
CACA 
CCAG 
CGGC 
GAGG 
GCCC 
GGAA 



The oligonucleotide tags of the invention and their complements are 
conveniently synthesized on an automated DNA synthesizer, e.g. an Applied 
Biosystems, Inc. (Foster City, California) model 392 or 394 DNA/RNA Synthesizer, 
using standard chemistries, such as phosphoramidite chemistry, e.g. disclosed in the 
following references: Beaucage and Iyer, Tetrahedron, 48: 2223-231 1 (1 992); Molko 
et al, U.S. patent 4,980,460; Koster et al, U.S. patent 4,725,677; Caruthers et al, U.S. 
patents 4,415,732; 4,458,066; and 4,973,679; and the like. Alternative chemistries, 
e.g. resulting in non-natural backbone groups, such as phosphorothioate, 
phosphoramidate, and the like, may also be employed provided that the resulting 
.oligonucleotides are capable of specific hybridization. In some embodiments, tags 
may comprise naturally occurring nucleotides that permit processing or manipulation 
by enzymes, while the corresponding tag complements may comprise non-natural 
nucleotide analogs, such as peptide nucleic acids, or like compounds, that promote the 
formation of more stable duplexes during sorting. 

When microparticles are used as supports, repertoires of oligonucleotide tags 
and tag complements may be generated by subunit-wise synthesis via "split and mix" 
techniques, e.g. as disclosed in Shortle et al. International patent application 
PCT/US93/034 1 8 or Lyttle et al, Biotechniques, 1 9: 274-280 ( 1 995). Briefly, the 
basic unit of the synthesis is a subunit of the oligonucleotide tag. Preferably, 
phosphoramidite chemistry is used and 3' phosphoramidite oligonucleotides are 
prepared for each subunit in a minimally cross-hybridizing set, e.g. for the set first 
listed above, there would be eight 4-mer 3'-phosphoramidites. Synthesis proceeds as 
disclosed by Shortle et al or in direct analogy with the techniques employed to 
generate diverse oligonucleotide libraries using nucleosidic monomers, e.g. as 
disclosed in Telenius et al, Genomics, 13: 718-725 (1992); Welsh et al, Nucleic Acids 
Research, 19: 5275-5279 (1991); Grothues et al, Nucleic Acids Research, 21: 1321- 
1322 (1993); Hartley, European patent application 90304496.4; Lam et al, Nature. 
354: 82-84 (1991); Zuckerman et al, Int. J. Pept. Protein Research, 40: 498-507 
(1992); and the like. Generally, these techniques simply call for the application of 
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mixtures of the activated monomers to the growing oligonucleotide during the 
coupling steps. Preferably, oligonucleotide tags and tag complements are synthesized 
on a DNA synthesizer having a number of synthesis chambers which is greater than or 
equal to the number of different kinds of words used in the construction of the tags. 
5 That is, preferably there is a synthesis chamber corresponding to each type of word. 
In this embodiment, words are added nucleotide-by-nucleotide, such that if a word 
consists of five nucleotides there are five monomer couplings in each synthesis 
chamber. After a word is completely synthesized, the synthesis supports are removed 
from the chambers, mixed, and redistributed back to the chambers for the next cycle 

1 0 of word addition. This latter embodiment takes advantage of the high coupling yields 
of monomer addition, e.g. in phosphoramidite chemistries. 

Double stranded forms of tags may be made by separately synthesizing the 
complementary strands followed by mixing under conditions that permit duplex 
formation. Alternatively, double stranded tags may be formed by first synthesizing a 

1 5 single stranded repertoire linked to a known oligonucleotide sequence that serves as a 
primer binding site. The second strand is then synthesized by combining the single 
stranded repertoire with a primer and extending with a polymerase. This latter 
approach is described in Oliphant et al, Gene, 44: 1 77-1 83 (1 986). Such duplex tags 
- may then be inserted into cloning vectors along with target polynucleotides for sorting 

20 and manipulation of the target polynucleotide in accordance with the invention. 

When tag complements are employed that are made up of nucleotides that 
have enhanced binding characteristics, such as PNAs or oligonucleotide N3'->P5* 
phosphoramidates, sorting can be implemented through the formation of D-loops 
between tags comprising natural nucleotides and their PNA or phosphoramidate 

25 complements, as an alternative to the "stripping" reaction employing the 3'— >5* 
exonuclease activity of a DNA polymerase to render a tag single stranded. 

Oligonucleotide tags of the invention may range in length from 12 to 60 
nucleotides or basepairs. Preferably, oligonucleotide tags range in length from 1 8 to 
40 nucleotides or basepairs. More preferably, oligonucleotide tags range in length 

30 from 25 to 40 nucleotides or basepairs. In terms of preferred and more preferred 
numbers of subunits, these ranges may be expressed as follows: 

Table IV 

Numbers of Subunits in Tags in Preferred Embodiments 



35 



Monomers 

in Subunit Nucleotides in Oligonucleotide Tag 

(12-60) (18-40) (25-40) 
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3 4-20 subunits 6-13 subunits 8-13 subunits 

4 3-15 subunits 4-10 subunits 6-10 subunits 

5 2-12 subunits 3-8 subunits 5-8 subunits 

6 2-10 subunits 3-6 subunits 4-6 subunits 

Most preferably, oligonucleotide tags are single stranded and specific hybridization 
occurs via Watson-Crick pairing with a tag complement. 

Preferably, repertoires of single stranded oligonucleotide tags of the invention 
contain at least 100 members; more preferably, repertoires of such tags contain at 
least 1000 members; and most preferably, repertoires of such tags contain at least 
10,000 members. 

Triplex Taps 

In embodiments where specific hybridization occurs via triplex formation, 
coding of tag sequences follows the same principles as for duplex-forming tags; 
however, there are further constraints on the selection of subunit sequences. 
Generally, third strand association via Hoogsteen type of binding is most stable along 
homopyrimidine-homopurine tracks in a double stranded target. Usually, base triplets 
form in T-A*T or C-G*C motifs (where "-" indicates Watson-Crick pairing and "*" 
indicates Hoogsteen type of binding); however, other motifs are also possible. For 
example, Hoogsteen base pairing permits parallel and antiparallel orientations 
between the third strand (the Hoogsteen strand) and the purine-rich strand of the 
duplex to which the third strand binds, depending on conditions and the composition 
of the strands. There is extensive guidance in the literature for selecting appropriate 
sequences, orientation, conditions, nucleoside type (e.g. whether ribose or 
deoxyribose nucleosides are employed), base modifications (e.g. methylated cytosine. 
and the like) in order to maximize, or otherwise regulate, triplex stability as desired in 
particular embodiments, e.g. Roberts et al, Proc. Natl. Acad. Sci., 88: 9397-9401 
(1991); Roberts et al, Science, 258: 1463-1466 (1992); Roberts et al, Proc. Natl. 
Acad. Sci., 93: 4320-4325 (1 996); Distefano et al, Proc. Natl. Acad. Sci., 90: 1 1 79- 
1 183 (1993); Mergny et al, Biochemistry, 30: 9791-9798 (1991); Cheng et al, J. Am. 
Chem. Soc, 1 14: 4465-4474 (1992); Beal and Dervan, Nucleic Acids Research, 20: 
2773-2776 (1992); Beal and Dervan, J. Am. Chem. Soc, 1 14: 4976-4982 (1992); 
Giovannangeli et al, Proc. Natl. Acad. Sci., 89: 8631-8635 (1992); Moser and Dervan, 
Science, 238: 645-650 (1987); McShan et al, J. Biol. Chem., 267:5712-5721 (1992); 
Yoon et al, Proc. Natl. Acad. Sci., 89: 3840-3844 (1992); Blume et al, Nucleic Acids 
Research, 20: 1777-1784 (1992); Thuong and Helene, Angew. Chem. Int. Ed. Engl. 
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32: 666-690 (1993); Escude et al, Proc. Natl. Acad. Sci., 93: 4365-4369 (1996); and 
the like. Conditions for annealing single-stranded or duplex tags to their single- 
stranded or duplex complements are well known, e.g. Ji et al, Anal. Chem. 65: 1323- 
1328 (1993); Cantor et al, U.S. patent 5,482,836; and the like. Use of triplex tags has 
5 the advantage of not requiring a "stripping" reaction with polymerase to expose the 
tag for annealing to its complement. 

Preferably, oligonucleotide tags of the invention employing triplex 
hybridization are double stranded DNA and the corresponding tag complements are 
single stranded. More preferably, 5-methylcytosine is used in place of cytosine in the 

1 0 tag complements in order to broaden the range of pH stability of the triplex formed 
between a tag and its complement. Preferred conditions for forming triplexes are 
fully disclosed in the above references. Briefly, hybridization takes place in 
concentrated salt solution, e.g. 1.0 M NaCl, 1.0 M potassium acetate, or the like, at 
pH below 5.5 ( or 6.5 if 5-methylcytosine is employed). Hybridization temperature 

1 5 depends on the length and composition of the tag; however, for an 1 8-20-mer tag of 
longer, hybridization at room temperature is adequate. Washes may be conducted 
with less concentrated salt solutions, e.g. 10 mM sodium acetate, 100 mM MgCl 2 , pH 
5.8, at room temperature. Tags may be eluted from their tag complements by 
- incubation in a similar salt solution at pH 9.0. 

20 Minimally cross-hybridizing sets of oligonucleotide tags that form triplexes 

may be generated by the computer program of Appendix Ic, or similar programs. An 
exemplary set of double stranded 8-mer words are listed below in capital letters with 
the corresponding complements in small letters. Each such word differs from each of 
the other words in the set by three base pairs. 

Table V 

Exemplary Minimally Cross-Hybridizing 
Set of DoubleStranded 8-mer Tags 



Z. t 


-AAGGAGAG 


5' 


-AAAGGGGA 


5' 


-AGAGAAGA 


5' 


^AGGGGGGG 


3' 


-TTCCTCTC 


3' 


-TTTCCCCT 


3' 


-TCTCTTCT 


3' 


-TCCCCCCC 


3' 


-ttcctctc 


3' 


-tttcccct 


3' 


-tctcttct 


3' 


-tccccccc 


C, ' 


-AAAAAAAA 


5' 


-AAGAGAGA 


5' 


-AGGAAAAG 


5' 


-GAAAGGAG 


f 




3' 


-TTCTCTCT 


3' 


-TCCTTTTC 


3' 


-CTTTCCTC 


3' 


-t Ltltttt 


3' 


-ttctctct 


3' 


-tcctt ttc 


3' 


-ctttcctc 


5 ' 


-AAAAAGGG 


5' 


-AGAAGAGG 


5' 


-AGGAAGGA 


5' 


-GAAGAAGG 


3' 


-TTTTTCCC 


3' 


-TCTTCTCC 


3' 


-TCCTTCCT 


3' 


-CTTCTTCC 


3 ' 


-tttttccc 


3' 


-tcttctcc 


3' 


-tccttcct 


3' 


-cttctrcc 


a t 

mJ 


-AAAGGAAG 


5' 


-AGAAGGAA 


5' 


-AGGGGAAA 


5' 


-GAAGAGAA 


3' 


-TTTCCTTC 


3' 


-TCTTCCTT 


3' 


-TCCCCTTT 


3' 


-CTTCTCTT 


3' 


-tttccttc 


3' 


-tcttcctt 


3' 


-tccccttt 


3' 


-cttctctt 
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5 



10 Table VI 

Repertoire Size of Various Double Stranded Tags 
That Form Triplexes with Their Tap Complements 



Oligonucleotid 
e 

Word 
Length 



Nucleotide 
Difference 
between 
Oligonucleotides 
of Minimally 
Cross- 
Hybridizing Set 



Maximal Size 
of Minimally 

Cross- 
Hybridizing 
Set 



Size of 
Repertoire 
with Four 

Words 



Size of 
Repertoire with 
Five Words 



4 


2 


8 


4096 


3.2 x I0 4 


6 


3 


8 


4096 


3.2 x 10 4 


8 


3 


16 


6.5 x 10 4 


1.05 x I0 6 


10 


5 


8 


4096 




15 


5 


92 






20 


6 


765 






20 


8 


92 






20 


10 


22 







1 5 Preferably, repertoires of double stranded oligonucleotide tags of the invention 

contain at least 10 members; more preferably, repertoires of such tags contain at least 
100 members. Preferably, words are between 4 and 8 nucleotides in length for 
combinatorially synthesized double stranded oligonucletide tags, and oligonucleotide 
tags are between 12 and 60 base pairs in length. More preferably, such tags are 

20 between 1 8 and 40 base pairs in length. 

Solid Phase Supports 
Solid phase supports for use with the invention may have a wide variety of 
forms, including microparticles, beads, and membranes, slides, plates, micromachined 
25 chips, and the like. Likewise, solid phase supports of the invention may comprise a 
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10 



wide variety of compositions, including glass, plastic, silicon, alkanethiolate- 
derivatized gold, cellulose, low cross-linked and high cross-linked polystyrene, silica 
gel, polyamide, and the like. Preferably, either a population of discrete particles are 
employed such that each has a uniform coating, or population, of complementary 
sequences of the same tag (and no other), or a single or a few supports are employed 
with spatially discrete regions each containing a uniform coating, or population, of 
complementary sequences to the same tag (and no other). In the latter embodiment, 
the area of the regions may vary according to particular applications; usually, the 
regions range in area from several um2, e.g. 3-5, to several hundred um2, e.g. 100- 
500. Preferably, such regions are spatially discrete so that signals generated by 
events, e.g. fluorescent emissions, at adjacent regions can be resolved by the detection 
system being employed. In some applications, it may be desirable to have regions 
with uniform coatings of more than one tag complement, e.g. for simultaneous 
sequence analysis, or for bringing separately tagged molecules into close proximity. 
1 5 Tag complements may be used with the solid phase support that they are 

synthesized on, or they may be separately synthesized and attached to a solid phase 
support for use, e.g. as disclosed by Lund et al, Nucleic Acids Research, 16: 10861- 
10880 (1988); Albretsen etal, Anal. Biochem., 189: 40-50 (1990); Wolf et al, Nucleic 
- Acids Research, 15: 291 1-2926 (1987); or Ghosh et al, Nucleic Acids Research, 15: 
20 5353-5372 (1987). Preferably, tag complements are synthesized on and used with the 
same solid phase support, which may comprise a variety of forms and include a 
variety of linking moieties. Such supports may comprise microparticles or arrays, or 
matrices, of regions where uniform populations of tag complements are synthesized. 
A wide variety of microparticle supports may be used with the invention, including 
25 microparticles made of controlled pore glass (CPG), highly cross-linked polystyrene, 
acrylic copolymers, cellulose, nylon, dextran, latex, polyacrolein, and the like, 
disclosed in the following exemplary references: Meth. Enzymol., Section A, pages 
1 1-147, vol. 44 (Academic Press, New York, 1976); U.S. patents 4,678,814; 
4,413,070; and 4,046;720; and Pon, Chapter 19, in Agrawal, editor, Methods in 
30 Molecular Biology, Vol. 20, (Humana Press, Totowa, NJ, 1 993). Microparticle 
supports further include commercially available nucleoside-derivatized CPG and 
polystyrene beads (e.g. available from Applied Biosystems, Foster City, CA); 
derivatized magnetic beads; polystyrene grafted with polyethylene glycol (e.g., 
TentaGel™ Rapp p 0 |ym e re, Tubingen Germany); and the like. Selection of the 
35 support characteristics, such as material, porosity, size, shape, and the like, and the 
type of linking moiety employed depends on the conditions under which the tags are 
used. For example, in applications involving successive processing with enzymes, 
supports and linkers that minimize steric hindrance of the enzymes and that facilitate 
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access to substrate are preferred. Other important factors to be considered in selecting 
the most appropriate microparticle support include size uniformity, efficiency as a 
synthesis support, degree to which surface area known, and optical properties, e.g. as 
explain more fully below, clear smooth beads provide instrumentational advantages 
5 when handling large numbers of beads on a surface. 

Exemplary linking moieties for attaching and/or synthesizing tags on 
microparticle surfaces are disclosed in Pon et al, Biotechniques, 6:768-775 (1988); 
Webb, U.S. patent 4,659,774; Barany et al, International patent application 
PCT/US9 1/06 103; Brown et al, J. Chem. Soc. Commun., 1989: 891-893; Damha et 

10 al, Nucleic Acids Research, 18: 3813-3821 (1990); Beattie et al, Clinical Chemistry', 
39: 719-722 (1993); Maskos and Southern, Nucleic Acids Research, 20: 1679-1684 
(1992); and the like. 

As mentioned above, tag complements may also be synthesized on a single 
(or a few) solid phase support to form an array of regions uniformly coated with tag 

1 5 complements. That is, within each region in such an array the same tag complement 
is synthesized. Techniques for synthesizing such arrays are disclosed in McGall et al, 
International application PCT/US93/03767; Pease et al, Proc. Natl. Acad. Sci., 91 : 
5022-5026 (1994); Southern and Maskos, International application 
.PCT/GB89/01 1 14; Maskos and Southern (cited above); Southern et al, Genomics, 13: 

20 1008-1017 (1992); and Maskos and Southern, Nucleic Acids Research, 21 : 4663- 
4669(1993). 

Preferably, the invention is implemented with microparticles or beads 
uniformly coated with complements of the same tag sequence. Microparticle supports 
and methods of covalently or noncovalently linking oligonucleotides to their surfaces 

25 are well known, as exemplified by the following references: Beaucage and Iyer (cited 
above); Gait, editor, Oligonucleotide Synthesis: A Practical Approach (IRL Press, 
Oxford, 1 984); and the references cited above. Generally, the size and shape of a 
microparticle is not critical; however, microparticles in the size range of a few, e.g. 1- 
2, to several hundred, e.g. 200- 1 000 urn diameter are preferable, as they facilitate the 

30 construction and manipulation of large repertoires of oligonucleotide tags with 
minimal reagent and sample usage. 

In some preferred applications, commercially available control led-pore glass 
(CPG) or polystyrene supports are employed as solid phase supports in the invention. 
Such supports come available with base-labile linkers and initial nucleosides attached. 

35 e.g. Applied Biosystems (Foster City, CA). Preferably, microparticles having pore 
size between 500 and 1000 angstroms are employed. 

In other preferred applications, non-porous microparticles are employed for 
their optical properties, which may be advantageously used when tracking large 
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numbers of microparticles on planar supports, such as a microscope slide. 
Particularly preferred non-porous microparticles are the glycidal methacrylate (GMA) 
beads available from Bangs Laboratories (Carmel, IN). Such microparticles are 
useful in a variety of sizes and derivatized with a variety of linkage groups for 
5 synthesizing tags or tag complements. Preferably, for massively parallel 

manipulations of tagged microparticles, 5 ^m diameter GMA beads are employed. 



Attaching Tags to Polynucleotides 
For Sorting onto Solid Phase Supports 
An important aspect of the invention is the sorting and attachment of a 
populations of polynucleotides, e.g. from a cDNA library, to microparticles or to 
1 5 separate regions on a solid phase support such that each microparticle or region has 
substantially only one kind of polynucleotide attached. This objective is 
accomplished by insuring that substantially all different polynucleotides have 
different tags attached. This condition, in turn, is brought about by taking a sample of 
- the full ensemble of tag-polynucleotide conjugates for analysis. (It is acceptable that 
20 identical polynucleotides have different tags, as it merely results in the same 

polynucleotide being operated on or analyzed twice in two different locations.) Such 
sampling can be carried out either overtly-for example, by taking a small volume 
from a larger mixture—after the tags have been attached to the polynucleotides, it can 
be carried out inherently as a secondary effect of the techniques used to process the 
25 polynucleotides and tags, or sampling can be carried out both overtly and as an 
inherent part of processing steps. 

Preferably, in constructing a cDNA library where substantially all different 
cDNAs have different tags, a tag repertoire is employed whose complexity, or number 
of distinct tags, greatly exceeds the total number of mRNAs extracted from a cell or 
30 tissue sample. Preferably, the complexity of the tag repertoire is at least 10 times that 
of the polynucleotide population; and more preferably, the complexity of the tag 
repertoire is at least 100 times that of the polynucleotide population. Below, a 
protocol is disclosed for cDNA library construction using a primer mixture that 
contains a full repertoire of exemplary 9- word tags. Such a mixture of tag-containing 

Q ft 

35 primers has a complexity of 8 , or about 1.34 x 10 . As indicated by Winslow et al, 
Nucleic Acids Research, 19: 3251-3253 (1991), mRNA for library construction can 
be extracted from as few as 10-100 mammalian cells. Since a single mammalian cell 
contains about 5 x 10 5 copies of mRNA molecules of about 3.4 x 10 4 different kinds, 
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by standard techniques one can isolate the mRNA from about 100 cells, or 
(theoretically) about 5 x 10 7 mRNA molecules. Comparing this number to the 
complexity of the primer mixture shows that without any additional steps, and even 
assuming that mRNAs are converted into cDNAs with perfect efficiency (1% 
5 efficiency or less is more accurate), the cDNA library construction protocol results in 
a population containing no more than 37% of the total number of different tags. That 
is, without any overt sampling step at all, the protocol inherently generates a sample 
that comprises 37%, or less, of the tag repertoire. The probability of obtaining a 
double under these conditions is about 5%, which is within the preferred range. With 
10 mRNA from 10 cells, the fraction of the tag repertoire sampled is reduced to only 
3.7%, even assuming that all the processing steps take place at 100% efficiency. In 
fact, the efficiencies of the processing steps for constructing cDNA libraries are very 
low, a "rule of thumb" being that good library should contain about 10 8 cDNA clones 
from mRNA extracted from 10 6 mammalian cells. 
15 Use of larger amounts of mRNA in the above protocol, or for larger amounts 

of polynucleotides in general, where the number of such molecules exceeds the 
complexity of the tag repertoire, a tag-polynucleotide conjugate mixture potentially 
contains every possible pairing of tags and types of mRNA or polynucleotide. In such 
- cases, overt sampling may be implemented by removing a sample volume after a 
20 serial dilution of the starting mixture of tag-polynucleotide conjugates. The amount 
of dilution required depends on the amount of starting material and the efficiencies of 
the processing steps, which are readily estimated. 

If mRNA were extracted from 10 6 cells (which would correspond to about 0.5 
|ig of poly(A)' RNA), and if primers were present in about 10-100 fold concentration 
25 excess-as is called for in a typical protocol, e.g. Sambrook et al, Molecular Cloning, 
Second Edition, page 8.61 [10|iL 1.8 kb mRNA at 1 mg/mL equals about 1.68x 10' 11 
moles and 10 18-mer primer at 1 mg/mL equals about 1.68 x 10* 9 moles], then the 
total number of tag-polynucleotide conjugates in a cDNA library would simply be 
equal to or less than the starting number of mRNAs, or about 5 x 10 H vectors 
30 containing tag-polynucleotide conjugates-again this assumes that each step in cDNA 
construction-first strand synthesis, second strand synthesis, ligation into a vector- 
occurs with perfect efficiency, which is a very conservative estimate. The actual 
number is significantly less. 

If a sample of n tag-polynucleotide conjugates are randomly drawn from a 
35 reaction mixture-as could be effected by taking a sample volume, the probability of 
drawing conjugates having the same tag is described by the Poisson distribution, 
p ( r ) = e*\x)7r, where r is the number of conjugates having the same tag and Jl=np, 
where p is the probability of a given tag being selected. If n=10 6 and p=l/(l .34 x 
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10 8 ), then ?i= 00746 and P(2)=2.76 x 10" 5 . Thus, a sample of one million molecules 
gives rise to an expected number of doubles well within the preferred range. Such a 
sample is readily obtained as follows: Assume that the 5 x 10 M mRNAs are perfectly 
converted into 5 x 10 n vectors with tag-cDNA conjugates as inserts and that the 5 x 
5 10 11 vectors are in a reaction solution having a volume of 100 pi. Four 10-fold serial 
dilutions may be carried out by transferring 10 pi from the original solution into a 
vessel containing 90 pi of an appropriate buffer, such as TE. This process may be 
repeated for three additional dilutions to obtain a 100 pi solution containing 5 x 10 5 
vector molecules per pi. A 2 pi aliquot from this solution yields 10 6 vectors 

1 0 containing tag-cDNA conjugates as inserts. This sample is then amplified by straight 
forward transformation of a competent host cell followed by culturing. 

Of course, as mentioned above, no step in the above process proceeds with 
perfect efficiency. In particular, when vectors are employed to amplify a sample of 
tag-polynucleotide conjugates, the step of transforming a host is very inefficient. 

1 5 Usually, no more than 1% of the vectors are taken up by the host and replicated. 

Thus, for such a method of amplification, even fewer dilutions would be required to 
obtain a sample of 1 0 6 conjugates. 

A repertoire of oligonucleotide tags can be conjugated to a population of 
- polynucleotides in a number of ways, including direct enzymatic ligation, 

20 amplification, e.g. via PCR, using primers containing the tag sequences, and the like. 
The initial ligating step produces a very large population of tag-polynucleotide 
conjugates such that a single tag is generally attached to many different 
polynucleotides. However, as noted above, by taking a sufficiently small sample of 
the conjugates, the probability of obtaining "doubles," i.e. the same tag on two 

25 different polynucleotides, can be made negligible. Generally, the larger the sample 
the greater the probability of obtaining a double. Thus, a design trade-off exists 
between selecting a large sample of tag-polynucleotide conjugates- which, for 
example, ensures adequate coverage of a target polynucleotide in a shotgun 
sequencing operation or adequate representation of a rapidly changing mRNA pool, 

30 and selecting a small sample which ensures that a minimal number of doubles will be 
present. In most embodiments, the presence of doubles merely adds an additional 
source of noise or, in the case of sequencing, a minor complication in scanning and 
signal processing, as microparticles giving multiple fluorescent signals can simply be 
ignored. 

35 As used herein, the term "substantially all" in reference to attaching tags to 

molecules, especially polynucleotides, is meant to reflect the statistical nature of the 
sampling procedure employed to obtain a population of tag-molecule conjugates 
essentially free of doubles. The meaning of substantially all in terms of actual 
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percentages of tag-molecule conjugates depends on how the tags are being employed. 
Preferably, for nucleic acid sequencing, substantially all means that at least eighty 
percent of the polynucleotides have unique tags attached. More preferably, it means 
that at least ninety percent of the polynucleotides have unique tags attached. Still 
5 more preferably, it means that at least ninety-five percent of the polynucleotides have 
unique tags attached. And, most preferably, it means that at least ninety-nine percent 
of the polynucleotides have unique tags attached. 

Preferably, when the population of polynucleotides consists of messenger 
RNA (mRNA), oligonucleotides tags may be attached by reverse transcribing the 
10 mRNA with a set of primers preferably containing complements of tag sequences. 
An exemplary set of such primers could have the following sequence (SEQ ID NO: 
1): 

5 ? -mRNA- [A] n -3' 
15 [T] 19GG[W,W,W,C] q AC CAGCTG ATC - 5 ' -biotin 



where M [W,W,W,C]9 M represents the sequence of an oligonucleotide tag of nine 
. subunits of four nucleotides each and M [W,W,W,C]" represents the subunit sequences 
20 listed above, i.e. "W" represents T or A. The underlined sequences identify an 

optional restriction endonuclease site that can be used to release the polynucleotide 
from attachment to a solid phase support via the biotin, if one is employed. For the 
above primer, the complement attached to a microparticle could have the form: 

25 5 1 - [G, W, W, W] gTGG-linker-microparticle 

After reverse transcription, the mRNA is removed, e.g. by RNase H digestion, 
and the second strand of the cDNA is synthesized using, for example, a primer of the 
following form (SEQ ID NO: 2): 



30 



5 1 -NRRGATCYNNN-3 1 



where N is any one of A, T, G, or C; R is a purine-containing nucleotide, and Y is a 
pyrimidine-containing nucleotide. This particular primer creates a Bst Yl restriction 
35 site in the resulting double stranded DNA which, together with the Sal I site, 

facilitates cloning into a vector with, for example, Bam HI and Xho 1 sites. After Bst 
Yl and Sal I digestion, the exemplary conjugate would have the form: 
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5'-RCGACCA[C,W,W,W] 9 GG[T] 19 - cDNA -NNNR 

GGT [G, W, W, W] 9CC [A] ig- rDNA -NNNYCTAG-5 1 

The polynucleotide-tag conjugates may then be manipulated using standard molecular 
5 biology techniques. For example, the above conjugate-which is actually a mixture- 
may be inserted into commercially available cloning vectors, e.g. Stratagene Cloning 
System (La Jolla, CA); transfected into a host, such as a commercially available host 
bacteria; which is then cultured to increase the number of conjugates. The cloning 
vectors may then be isolated using standard techniques, e.g. Sambrook et al, 

1 0 Molecular Cloning, Second Edition (Cold Spring Harbor Laboratory, New York, 
1989). Alternatively, appropriate adaptors and primers may be employed so that the 
conjugate population can be increased by PCR. 

Preferably, when the ligase-based method of sequencing is employed, the Bst 
Yl and Sal I digested fragments are cloned into a Bam HI-/Xho I-digested vector 

1 5 having the following single-copy restriction sites (SEQ ID NO: 3): 

5 ' -GA GGATG CCTTTAT GGATCCACTCGAG ATCCCAATCCA- 3 ' 
Fokl BamHI Xhol 

20 

This adds the Fok I site which will allow initiation of the sequencing process 
discussed more fully below. 

Tags can be conjugated to cDNAs of existing libraries by standard cloning 
methods. cDNAs are excised from their existing vector, isolated, and then ligated into 

25 a vector containing a repertoire of tags. Preferably, the tag-containing vector is 

linearized by cleaving with two restriction enzymes so that the excised cDNAs can be 
ligated in a predetermined orientation. The concentration of the linearized tag- 
containing vector is in substantial excess over that of the cDNA inserts so that 
ligation provides an inherent sampling of tags. 

30 A general method for exposing the single stranded tag after amplification 

involves digesting a target polynucleotide-containing conjugate with the 5'-*3' 
exonuclease activity of T4 DNA polymerase, or a like enzyme. When used in the 
presence of a single deoxynucleoside triphosphate, such a polymerase will cleave 
nucleotides from 3' recessed ends present on the non-template strand of a double 

35 stranded fragment until a complement of the single deoxynucleoside triphosphate is 
reached on the template strand. When such a nucleotide is reached the 5^3 ' 
digestion effectively ceases, as the polymerase's extension activity adds nucleotides at 
a higher rate than the excision activity removes nucleotides. Consequently, single 
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stranded tags constructed with three nucleotides are readily prepared for loading onto 
solid phase supports. . 

The technique may also be used to preferentially methylate interior Fok I sites 
of a target polynucleotide while leaving a single Fok I site at the terminus of the 
5 polynucleotide unmethylated. First, the terminal Fok I site is rendered single stranded 
using a polymerase with deoxycytidine triphosphate. The double stranded portion of 
the fragment is then methylated, after which the single stranded terminus is filled in 
with a DNA polymerase in the presence of all four nucleoside triphosphates, thereby 
regenerating the Fok I site. Clearly, this procedure can be generalized to 

1 0 endonucleases other than Fok I. 

After the oligonucleotide tags are prepared for specific hybridization, e.g. by 
rendering them single stranded as described above, the polynucleotides are mixed 
with microparticles containing the complementary sequences of the tags under 
conditions that favor the formation of perfectly matched duplexes between the tags 

1 5 and their complements. There is extensive guidance in the literature for creating these 
conditions. Exemplary references providing such guidance include Wetmur, Critical 
Reviews in Biochemistry and Molecular Biology, 26: 227-259 (1991); Sambrook et 
al, Molecular Cloning: A Laboratory Manual, 2nd Edition (Cold Spring Harbor 
• Laboratory, New York, 1989); and the like. Preferably, the hybridization conditions 

20 are sufficiently stringent so that only perfectly matched sequences form stable 

duplexes. Under such conditions the polynucleotides specifically hybridized through 
their tags may be ligated to the complementary sequences attached to the 
microparticles. Finally, the microparticles are washed to remove polynucleotides with 
unligated and/or mismatched tags. 

25 When CPG microparticles conventionally employed as synthesis supports are 

used, the density of tag complements on the microparticle surface is typically greater 
than that necessary for some sequencing operations. That is, in sequencing 
approaches that require successive treatment of the attached polynucleotides with a 
variety of enzymes, densely spaced polynucleotides may tend to inhibit access of the 

30 relatively bulky enzymes to the polynucleotides. In such cases, the polynucleotides 
are preferably mixed with the microparticles so that tag complements are present in 
significant excess, e.g. from 10:1 to 100:1, or greater, over the polynucleotides. This 
ensures that the density of polynucleotides on the microparticle surface will not be so 
high as to inhibit enzyme access. Preferably, the average inter-polynucleotide spacing 

35 on the microparticle surface is on the order of 30-100 nm. Guidance in selecting 

ratios for standard CPG supports and Ballotini beads (a type of solid glass support) is 
found in Maskos and Southern, Nucleic Acids Research, 20: 1679-1684 (1992). 
Preferably, for sequencing applications, standard CPG beads of diameter in the range 
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of 20-50 (im are loaded with about 10^ polynucleotides, and GMA beads of diameter 
in the range of 5-10 |im are loaded with a few tens of thousand of polynucleotides, 
e.g. 4x 10 4 to6x 10 4 . 

In the preferred embodiment, tag complements are synthesized on 

5 microparticles combinatorially; thus, at the end of the synthesis, one obtains a 

complex mixture of microparticles from which a sample is taken for loading tagged 
polynucleotides. The size of the sample of microparticles will depend on several 
factors, including the size of the repertoire of tag complements, the nature of the 
apparatus for used for observing loaded microparticles—e.g. its capacity, the tolerance 

10 for multiple copies of microparticles with the same tag complement (i.e. "bead 
doubles"), and the like. The following table provide guidance regarding 
microparticle sample size, microparticle diameter, and the approximate physical 
dimensions of a packed array of microparticles of various diameters. 



Microparticle diameter 5 ^m lOjim 20 40 urn 

Max. no. 

polynucleotides loaded 

atlperl0 5 sq. 3 x I0 5 1.26 x 10 6 5 x 10 6 

angstrom 

Approx. area of 
monolayer of 10 6 

microparticles .45 x .45 cm 1 x I cm 2 x 2 cm 4 x 4 cm 



20 The probability that the sample of microparticles contains a given tag complement or 
is present in multiple copies is described by the Poisson distribution, as indicated in 
the following table. 

25 

Table VII 
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Numhf*r nf 

microparlicles in 
samnle fas fraction 
of reDertoire sizel 
m 


Fraction of 
rcpcnuirc ui log 
complements 

nrpcpnt in 
pivaciii ui 

cam nl<* 

1 c 


Fraction of 
microparticles in 
sample with unique 
lag complement 

allaCncu, 

m^e y z 


Fraction of 
microparticles in 
sample carrying 
same lag 
complement as one 
other microparticle 

in sample 
( bead doubles ), 
m*(e m )/2 


1 000 


0 151 


A 1*7 

U.J / 


0.18 


.U7J 


0 


0.35 


0.12 


.*t\jj 


UJi 


0.27 


0.05 


.285 


0.25 


0.21 


0.03 


.223 


0.20 


0.18 


0.02 


.105 


0.10 


0.09 


0.005 


.010 


0.0 1 


0.01 





High Specificity Sorting and Panning 
5 The kinetics of sorting depends on the rate of hybridization of oligonucleotide 

tags to their tag complements which, in turn, depends on the complexity of the tags in 
- the hybridization reaction. Thus, a trade off exists between sorting rate and tag 
complexity, such that an increase in sorting rate may be achieved at the cost of 
reducing the complexity of the tags involved in the hybridization reaction. As 

1 0 explained below, the effects of this trade off may be ameliorated by "panning." 

Specificity of the hybridizations may be increased by taking a sufficiently 
small sample so that both a high percentage of tags in the sample are unique and the 
nearest neighbors of substantially all the tags in a sample differ by at least two words. 
This latter condition may be met by taking a sample that contains a number of tag- 

1 5 polynucleotide conjugates that is about 0.1 percent or less of the size of the repertoire 
being employed. For example, if tags are constructed with eight words selected from 
Table II, a repertoire of 8 8 , or about 1 .67 x 10 7 , tags and tag complements are 
produced. In a library of tag-cDNA conjugates as described above, a 0. 1 percent 
sample means that about 16,700 different tags are present. If this were loaded directly 

20 onto a repertoire-equivalent of microparticles, or in this example a sample of 1 .67 x 
10 7 microparticles, then only a sparse subset of the sampled microparticles would be 
loaded. The density of loaded microparticles can be increase-for example, for more 
efficient sequencing--by undertaking a "panning" step in which the sampled tag- 
cDNA conjugates are used to separate loaded microparticles from unloaded 

25 microparticles. Thus, in the example above, even though a "0. 1 percent" sample 
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contains only 16,700 cDNAs, the sampling and panning steps may be repeated until 
as many loaded microparticles as desired are accumulated. 

A panning step may be implemented by providing a sample of tag-cDNA 
conjugates each of which contains a capture moiety at an end opposite, or distal to, 
5 the oligonucleotide tag. Preferably, the capture moiety is of a type which can be 
released from the tag-cDNA conjugates, so that the tag-cDNA conjugates can be 
sequenced with a single-base sequencing method. Such moieties may comprise 
biotin, digoxigenin, or like ligands, a triplex binding region, or the like. Preferably, 
such a capture moiety comprises a biotin component. Biotin may be attached to tag- 

1 0 cDNA conjugates by a number of standard techniques. If appropriate adapters 

containing PCR primer binding sites are attached to tag-cDNA conjugates, biotin may 
be attached by using a biotinylated primer in an amplification after sampling. 
Alternatively, if the tag-cDNA conjugates are inserts of cloning vectors, biotin may be 
attached after excising the tag-cDNA conjugates by digestion with an appropriate 

1 5 restriction enzyme followed by isolation and filling in a protruding strand distal to the 
tags with a DNA polymerase in the presence of biotinylated uridine triphosphate. 

After a tag-cDNA conjugate is captured, it may be released from the biotin 
moiety in a number of ways, such as by a chemical linkage that is cleaved by 
-reduction, e.g. Herman et al, Anal. Biochem., 156: 48-55 (1986), or that is cleaved 

20 photochemical ly, e.g. Olejnik et al, Nucleic Acids Research, 24: 361-366 (1996), or 
that is cleaved enzymatically by introducing a restriction site in the PCR primer. The 
latter embodiment can be exemplified by considering the library of tag-polynucleotide 
conjugates described above: 

25 5'-RCGACCA[C,W,W,W] 9 GG[T] 19 - cDNA -NNNR 

GGT[G,W, W, W] 9 CC[A] 19 - rDNA -NNNYCTAG-5 1 

The following adapters may be ligated to the ends of these fragments to permit 
amplification by PCR: 

30 

5 ' - xxxxxxxxxxxxxxxxxxxx 

XXXXXXXXXXXXXXXXXXXXYGAT 
35 Right Adapter 



GATCZZACTAGTZZZZZZZZZZZZ-3 ' 
40 ZZTGATCAZZZZZZZZZZZZ 
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Left Adapter 
ZZTGATCAZZZZZZZZZZZZ-5 ' -biotin 

5 

Left Primer 

where "ACTAGT" is a Spe I recognition site (which leaves a staggered cleavage 
ready for single base sequencing), and the X's and Z's are nucleotides selected so that 

1 0 the annealing and dissociation temperatures of the respective primers are 

approximately the same. After ligation of the adapters and amplification by PCR 
using the biotinylated primer, the tags of the conjugates are rendered single stranded 
by the exonuclease activity of T4 DNA polymerase and conjugates are combined with 
a sample of microparticles, e.g. a repertoire equivalent, with tag complements 

1 5 attached. After annealing under stringent conditions (to minimize mis-attachment of 
tags), the conjugates are preferably ligated to their tag complements and the loaded 
microparticles are separated from the unloaded microparticles by capture with 
avidinated magnetic beads, or like capture technique. 

Returning to the example, this process results in the accumulation of about 

20 1 0,500 (=16 J00 x .63) loaded microparticles with different tags, which may be 

released from the magnetic beads by cleavage with Spe I. By repeating this process 
40-50 times with new samples of microparticles and tag-cDNA conjugates, 4-5 x 10 5 
cDNAs can be accumulated by pooling the released microparticles. The pooled 
microparticles may then be simultaneously sequenced by a single-base sequencing 

25 technique. 

Determining how many times to repeat the sampling and panning steps-or 
more generally, determining how many cDNAs to analyze, depends on one's 
objective. If the objective is to monitor the changes in abundance of relatively 
common sequences, e.g. making up 5% or more of a population, then relatively small 

30 samples, i.e. a small fraction of the total population size, may allow statistically 
significant estimates of relative abundances. On the other hand, if one seeks to 
monitor the abundances of rare sequences, e.g. making up 0.1% or less of a 
population, then large samples are required. Generally, there is a direct relationship 
between sample size and the reliability of the estimates of relative abundances based 

35 on the sample. There is extensive guidance in the literature on determining 

appropriate sample sizes for making reliable statistical estimates, e.g. Koller et al, 
Nucleic Acids Research, 23:185-191 (1994); Good, Biometrika, 40: 16-264 (1953); 
Bunge et al, J. Am. Stat. Assoc., 88: 364-373 (1993); and the like. Preferably, for 
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monitoring changes in gene expression based on the analysis of a series of cDNA 
libraries containing 10 5 to 10 8 independent clones of 3.0-3.5 x 10 4 different 
sequences, a sample of at least 10 4 sequences are accumulated for analysis of each 
library. More preferably, a sample of at least 10 5 sequences are accumulated for the 
5 analysis of each library; and most preferably, a sample of at least 5 x 10 5 sequences 
are accumulated for the analysis of each library. Alternatively, the number of 
sequences sampled is preferably sufficient to estimate the relative abundance of a 
sequence present at a frequency within the range of 0.1% to 5% with a 95% 
confidence limit no larger than 0.1% of the population size. 

10 

Single Base DNA Sequencing 
The present invention can be employed with conventional methods of DNA 
sequencing, e.g. as disclosed by Hultman et al, Nucleic Acids Research, 17: 4937- 
4946 (1 989). However, for parallel, or simultaneous, sequencing of multiple 

1 5 polynucleotides, a DNA sequencing methodology is preferred that requires neither 
electrophoretic separation of closely sized DNA fragments nor analysis of cleaved 
nucleotides by a separate analytical procedure, as in peptide sequencing. Preferably, 
the methodology permits the stepwise identification of nucleotides, usually one at a 
- time, in a sequence through successive cycles of treatment and detection. Such 

20 methodologies are referred to herein as "single base" sequencing methods. Single 
base approaches are disclosed in the following references: Cheeseman, U.S. patent 
5,302,509; Tsien et al, International application WO 91/06678; Rosenthal et al, 
International application WO 93/21340; Canard et al, Gene, 148: 1-6 (1994); and 
Metzker et al, Nucleic Acids Research, 22: 4259-4267 (1994). 

25 A "single base" method of DNA sequencing which is suitable for use with the 

present invention and which requires no electrophoretic separation of DNA fragments 
is described in International application PCT/US95/03678. Briefly, the method 
comprises the following steps: (a) ligating a probe to an end of the polynucleotide 
having a protruding strand to form a ligated complex, the probe having a 

30 complementary protruding strand to that of the polynucleotide and the probe having a 
nuclease recognition site; (b) removing unligated probe from the ligated complex; (c) 
identifying one or more nucleotides in the protruding strand of the polynucleotide by 
the identity of the ligated probe; (d) cleaving the ligated complex with a nuclease; and 
(e) repeating steps (a) through (d) until the nucleotide sequence of the polynucleotide. 

35 or a portion thereof, is determined, 

A single signal generating moiety, such as a single fluorescent dye, may be 
employed when sequencing several different target polynucleotides attached to 
different spatially addressable solid phase supports, such as fixed microparticles, in a 
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parallel sequencing operation. This may be accomplished by providing four sets of 
probes that are applied sequentially to the plurality of target polynucleotides on the 
different microparticles. An exemplary set of such probes are shown below: 



SeM 

ANNNN . . . NN 

N. . . NNTT . . .T* 

dCNNNN . . . NN 

N . . . NNTT . . . T 

dGNNNN . . . NN 

N. . .NNTT. . .T 

dTNNNN . . . NN 

N . . . NNTT . . . T 



Set 2 



Set 3 



Set 4 



dANNNN . . . NN dANNNN . . . NN dANNNN . . . NN 

d N...NNTT...T N...NNTT...T N...NNTT...T 

CNNNN . . . NN dCNNNN . . . NN dCNNNN . . . NN 

N. . .NNTT. . .T* N. . .NNTT. . .T N. . .NNTT. . .T 

dGNNNN . . . NN GNNNN . . . NN dGNNNN . . . NN 

N . . . NNTT . . . T N . . . NNTT . . . T* N . . . NNTT . . . T 

dTNNNN . . . NN dTNNNN . . . NN TNNNN . . . NN 

N . . . NNTT . . . T N. . . NNTT . . . T N . . . NNTT . . . T * 



where each of the listed probes represents a mixture of 4 3 =64 oligonucleotides such 
that the identity of the 3* terminal nucleotide of the top strand is fixed and the other 
positions in the protruding strand are filled by every 3-mer permutation of nucleotides, 

10 or complexity reducing analogs. The listed probes are also shown with a single 

stranded poly-T tail with a signal generating moiety attached to the terminal thymidine, 
shown as "T* n . The "d M on the unlabeled probes designates a ligation-blocking moiety 
or absense of 3'-hydroxyl, which prevents unlabeled probes from being ligated. 
Preferably, such 3-terminal nucleotides are dideoxynucleotides. In this embodiment, 

1 5 the probes of set 1 are first applied to the plurality of target polynucleotides and treated 
with a ligase so that target polynucleotides having a thymidine complementary to the 3' 
terminal adenosine of the labeled probes are ligated. The unlabeled probes are 
simultaneously applied to minimize inappropriate ligations. The locations of the target 
polynucleotides that form ligated complexes with probes terminating in "A" are 

20 identified by the signal generated by the label carried on the probe. After washing and 
cleavage, the probes of set 2 are applied. In this case, target polynucleotides forming 
ligated complexes with probes terminating in "C" are identified by location. Similarly, 
the probes of sets 3 and 4 are applied and locations of positive signals identified. This 
process of sequentially applying the four sets of probes continues until the desired 

25 number of nucleotides are identified on the target polynucleotides. Clearly, one of 
ordinary skill could construct similar sets of probes that could have many variations, 
such as having protruding strands of different lengths, different moieties to block 
ligation of unlabeled probes, different means for labeling probes, and the like. 
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Apparatus for Sequencing Populations of Polynucleotides 
An objective of the invention is to sort identical molecules, particularly 
polynucleotides, onto the surfaces of microparticles by the specific hybridization of 
tags and their complements. Once such sorting has taken place, the presence of the 
5 molecules or operations performed on them can be detected in a number of ways 
depending on the nature of the tagged molecule, whether microparticles are detected 
separately or in "batches/ 1 whether repeated measurements are desired, and the like. 
Typically, the sorted molecules are exposed to ligands for binding, e.g. in drug 
development, or are subjected chemical or enzymatic processes, e.g. in polynucleotide 

1 0 sequencing. In both of these uses it is often desirable to simultaneously observe 

signals corresponding to such events or processes on large numbers of microparticles. 
Microparticles carrying sorted molecules (referred to herein as "loaded" 
microparticles) lend themselves to such large scale parallel operations, e.g. as 
demonstrated by Lam et al (cited above). 

1 5 Preferably, whenever light-generating signals, e.g. chemiluminescent, 

fluorescent, or the like, are employed to detect events or processes, loaded 
microparticles are spread on a planar substrate, e.g. a glass slide, for examination with 
a scanning system, such as described in International patent applications 
. PCT/US9 1/0921 7, PCT/NL90/00081 , and PCT/US95/01 886. The scanning system 

20 should be able to reproducibly scan the substrate and to define the positions of each 
microparticle in a predetermined region by way of a coordinate system. In 
polynucleotide sequencing applications, it is important that the positional 
identification of microparticles be repeatable in successive scan steps. 

Such scanning systems may be constructed from commercially available 

25 components, e.g. x-y translation table controlled by a digital computer used with a 
detection system comprising one or more photomultiplier tubes, or alternatively, a 
CCD array, and appropriate optics, e.g. for exciting, collecting, and sorting 
fluorescent signals. In some embodiments a confocal optica] system may be 
desirable. An exemplary scanning system suitable for use in four-color sequencing is 

30 illustrated diagrammatically in Figure 5. Substrate 300, e.g. a microscope slide with 
fixed microparticles, is placed on x-y translation table 302, which is connected to and 
controlled by an appropriately programmed digital computer 304 which may be any of 
a variety of commercially available personal computers, e.g. 486-based machines or 
PowerPC model 7100 or 8100 available form Apple Computer (Cupertino, CA). 

35 Computer software for table translation and data collection functions can be provided 
by commercially available laboratory software, such as Lab Windows, available from 
National Instruments. 
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Substrate 300 and table 302 are operationally associated with microscope 306 
having one or more objective lenses 308 which are capable of collecting and 
delivering light to microparticles fixed to substrate 300. Excitation beam 3 1 0 from 
light source 312, which is preferably a laser, is directed to beam splitter 314, e.g. a 
5 dichroic mirror, which re-directs the beam through microscope 306 and objective lens 
308 which, in turn, focuses the beam onto substrate 300. Lens 308 collects 
fluorescence 316 emitted from the microparticles and directs it through beam splitter 
314 to signal distribution optics 3 1 8 which, in turn, directs fluorescence to one or 
more suitable opto-electronic devices for converting some fluorescence characteristic, 

1 0 e.g. intensity, lifetime, or the like, to an electrical signal. Signal distribution optics 
3 1 8 may comprise a variety of components standard in the art, such as bandpass 
filters, fiber optics, rotating mirrors, fixed position mirrors and lenses, diffraction 
gratings, and the like. As illustrated in Figure 2, signal distribution optics 3 1 8 directs 
fluorescence 316 to four separate photomultiplier tubes, 330, 332, 334, and 336, 

1 5 whose output is then directed to pre-amps and photon counters 350, 352, 354, and 
356. The output of the photon counters is collected by computer 304, where it can be 
stored, analyzed, and viewed on video 360. Alternatively, signal distribution optics 
3 1 8 could be a diffraction grating which directs fluorescent signal 3 1 8 onto a CCD 
- array. 

20 The stability and reproducibility of the positional localization in scanning will 

determine, to a large extent, the resolution for separating closely spaced 
microparticles. Preferably, the scanning systems should be capable of resolving 
closely spaced microparticles, e.g. separated by a particle diameter or less. Thus, for 
most applications, e.g. using CPG microparticles, the scanning system should at least 

25 have the capability of resolving objects on the order of 1 0- 1 00 urn. Even higher 
resolution may be desirable in some embodiments, but with increase resolution, the 
time required to fully scan a substrate will increase; thus, in some embodiments a 
compromise may have to be made between speed and resolution. Increases in 
scanning time can be achieved by a system which only scans positions where 

30 microparticles are known to be located, e.g from an initial full scan. Preferably, 

microparticle size and scanning system resolution are selected to permit resolution of 
fluorescently labeled microparticles randomly disposed on a plane at a density 
between about ten thousand to one hundred thousand microparticles per cm 2 . 

In sequencing applications, loaded microparticles can be fixed to the surface 

35 of a substrate in variety of ways. The fixation should be strong enough to allow the 
microparticles to undergo successive cycles of reagent exposure and washing without 
significant loss. When the substrate is glass, its surface may be derivatized with an 
alkylamino linker using commercially available reagents, e.g. Pierce Chemical, which 
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in turn may be cross-linked to avidin, again using conventional chemistries, to form 
an avidinated surface. Biotin moieties can be introduced to the loaded microparticles 
in a number of ways. For example, a fraction, e.g. 10-15 percent, of the cloning 
vectors used to attach tags to polynucleotides are engineered to contain a unique 
5 restriction site (providing sticky ends on digestion) immediately adjacent to the 
polynucleotide insert at an end of the polynucleotide opposite of the tag. The site is 
excised with the polynucleotide and tag for loading onto microparticles. After 
loading, about 10-15 percent of the loaded polynucleotides will possess the unique 
restriction site distal from the microparticle surface. After digestion with the 

1 0 associated restriction endonuclease, an appropriate double stranded adaptor 

containing a biotin moiety is ligated to the sticky end. The resulting microparticles 
are then spread on the avidinated glass surface where they become fixed via the 
biotin-avidin linkages. 

Alternatively and preferably when sequencing by ligation is employed, in the 

1 5 initial ligation step a mixture of probes is applied to the loaded microparticle: a 

fraction of the probes contain a type lis restriction recognition site, as required by the 
sequencing method, and a fraction of the probes have no such recognition site, but 
instead contain a biotin moiety at its non-ligating end. Preferably, the mixture 
- comprises about 10-15 percent of the biotinylated probe. 

20 In still another alternative, when DNA-loaded microparticles are applied to a 

glass substrate, the DNA may nonspecifically adsorb to the glass surface upon several 
hours, e.g. 24 hours, incubation to create a bond sufficiently strong to permit repeated 
exposures to reagents and washes without significant loss of microparticles. 
Preferably, such a glass substrate is a flow cell, which may comprise a channel etched 

25 in a glass slide. Preferably, such a channel is closed so that fluids may be pumped 
through it and has a depth sufficiently close to the diameter of the microparticles so 
that a monolayer of microparticles is trapped within a defined observation region. 

Identification of Novel Polynucleotides 
30 in cDNA Libraries 

Novel polynucleotides in a cDNA library can be identified by constructing a 
library of cDNA molecules attached to microparticles, as described above. A large 
fraction of the library, or even the entire library, can then be partially sequenced in 
parallel. After isolation of mRNA, and perhaps normalization of the population as 
35 taught by Soares et al, Proc. Natl. Acad. Sci., 91 : 9228-9232 (1994), or like 

references, the following primer may by hybridized to the polyA tails for first strand 
synthesis with a reverse transcriptase using conventional protocols (SEQ ID NO: 1 ): 
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5 1 -mRNA- [A] n -3* 

[T] 19 - [primer site] -GG [W, W, W, C] gACCAGCTGATC-5 ' 

where [W,W,W,C]9 represents a tag as described above, "ACCAGCTGATC" is an 
5 optional sequence forming a restriction site in double stranded form, and "primer site" 
is a sequence common to all members of the library that is later used as a primer 
binding site for amplifying polynucleotides of interest by PCR. 

After reverse transcription and second strand synthesis by conventional 
techniques, the double stranded fragments are inserted into a cloning vector as 

1 0 described above and amplified. The amplified library is then sampled and the sample 
amplified. The cloning vectors from the amplified sample are isolated, and the tagged 
cDNA fragments excised and purified. After rendering the tag single stranded with a 
polymerase as described above, the fragments are methylated and sorted onto 
microparticles in accordance with the invention. Preferably, as described above, the 

1 5 cloning vector is constructed so that the tagged cDNAs can be excised with an 

endonuclease, such as Fok I, that will allow immediate sequencing by the preferred 
single base method after sorting and ligation to microparticles. 

Stepwise sequencing is then carried out simultaneously on the whole library, 
or one or more large fractions of the library, in accordance with the invention until a 

20 "sufficient number of nucleotides are identified on each cDNA for unique 

representation in the genome of the organism from which the library is derived. For 
example, if the library is derived from mammalian mRNA then a randomly selected 
sequence 14-15 nucleotides long is expected to have unique representation among the 
2-3 thousand megabases of the typical mammalian genome. Of course identification 

25 of far fewer nucleotides would be sufficient for unique representation in a library 
derived from bacteria, or other lower organisms. Preferably, at least 20-30 
nucleotides are identified to ensure unique representation and to permit construction 
of a suitable primer as described below. The tabulated sequences may then be 
compared to known sequences to identify unique cDNAs. 

30 Unique cDNAs are then isolated by conventional techniques, e.g. constructing 

a probe from the PCR amplicon produced with primers directed to the prime site and 
the portion of the cDNA whose sequence was determined. The probe may then be 
used to identify the cDNA in a library using a conventional screening protocol. 

The above method for identifying new cDNAs may also be used to fingerprint 

35 mRNA populations, either in isolated measurements or in the context of a 
dynamically changing population. Partial sequence information is obtained 
simultaneously from a large sample, e.g. ten to a hundred thousand, or more, of 
cDNAs attached to separate microparticles as described in the above method. 
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Example 1 

Construction of a Tag Library 
An exemplary tag library is constructed as follows to form the chemically 
5 synthesized 9-word tags of nucleotides A, G, and T defined by the formula: 

3'-TGGC-[ 4 (A,G,T) 9 ]-CCCCp 

where "[ 4 (A,G,T)9]" indicates a tag mixture where each tag consists of nine 4-mer 
10 words of A, G, and T; and "p M indicate a 5' phosphate. This mixture is ligated to the 
following right and left primer binding regions (SEQ ID NO: 4 and SEQ ID NO 5): 

5'- AGTGGCTGGGCATCGGACCG 5'- GGGGCCCAGTCAGCGTCGAT 

TCACCGACCCGTAGCCp GGGTCAGTCGCAGCTA 



15 



LEFT RIGHT 



The right and left primer binding regions are ligated to the above tag mixture, after 
which the single stranded portion of the ligated structure is filled with DNA 
20 'polymerase then mixed with the right and left primers indicated below and amplified 
to give a tag library (SEQ ID NO: 6). 



30 



Left Primer 

5 ' - AGTGGCTGGGCATCGGACCG 



5'- AGTGGCTGGGCATCGGACCG- ( 4 (A, G, T) 9] -GGGGCCCAGTCAGCGTCGAT 
TCACCGACCCGTA GCCTGGC - [ 4 (A, G, T) 9] -C CCCGGG TCAGT CGCAG CTA 



CCCCGGGTCAGTCGCAGCTA- 5 ' 
Right Primer 

35 The underlined portion of the left primer binding region indicates a Rsr II recognition 
site. The left-most underlined region of the right primer binding region indicates 
recognition sites for Bsp 1201, Apa I, and Eco O 1091, and a cleavage site for Hga I. 
The right-most underlined region of the right primer binding region indicates the 
recognition site for Hga I. Optionally, the right or left primers may be synthesized 

40 with a biotin attached (using conventional reagents, e.g. available from Clontech 
Laboratories, Palo Alto, CA) to facilitate purification after amplification and/or 
cleavage. 
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primer binding site Ppu MI site 

1 i 




The plasmid is cleaved with Ppu MI and Pme 1 (to give a Rsr H-compatible end and a 
flush end so that the insert is oriented) and then methylated with DAM methylase. 
The tag-containing construct is cleaved with Rsr II and then ligated to the open 
plasmid, after which the conjugate is cleaved with Mbo I and Bam HI to permit 
20 ligation and closing of the plasmid. The plasmid is then amplified and isolated and 
- used in accordance with the invention. 



Example 3 

Changes in Gene Expression Profiles in Liver Tissue of Rats 

25 Exposed to Various Xenobiotic Agents 

In this experiment, to test the capability of the method of the invention to 
detect genes induced as a result of exposure to xenobiotic compounds, the gene 
expression profile of rat liver tissue is examined following administration of several 
compounds known to induce the expression of cytochrome P-450 isoenzymes. The 

30 results obtained from the method of the invention are compared to results obtained 

from reverse transcriptase PCR measurements and immunochemical measurements of 
the cytochrome P-450 isoenzymes. Protocols and materials for the latter assays are 
described in Morris et al, Biochemical Pharmacology, 52: 781-792 (1996). 

Male Sprague-Dawley rats between the ages of 6 and 8 weeks and weighing 

35 200-300 g are used, and food and water are available to the animals ad lib. Test 
compounds are phenobarbital (PB), metyrapone (MET), dexamethasone (DEX), 
clofibrate (CLO), corn oil (CO), and p-naphthoflavone (BNF), and are available from 
Sigma Chemical Co. (St. Louis, MO). Antibodies against specific P-450 enzymes are 
available from the following sources: rabbit anti-rat CYP3A1 from Human Biologies, 

40 Inc. (Phoenix, AZ); goat anti-rat CYP4A1 from Daiichi Pure Chemicals Co. (Tokyo, 
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Japan); monoclonal mouse anti-rat CYP1A1, monoclonal mouse anti-rat CYP2C1 1, 
goat anti-rat CYP2E1, and monoclonal mouse anti-rat CYP2B1 from Oxford 
Biochemical Research, Inc. (Oxford, MI). Secondary antibodies (goat anti-rabbit IgG, 
rabbit anti-goat IgG and goat anti-mouse IgG) are available from Jackson 
5 ImmunoResearch Laboratories (West Grove, PA). 

Animals are administered either PB (1 00 mg/kg), BNF (100 mg/kg), MET 
(100 mg/kg), DEX (100 mg/kg), or CLO (250 mg/kg) for 4 consecutive days via 
intraperitoneal injection following a dosing regimen similar to that described by 
Wang et al, Arch. Biochem. Biophys. 290: 355-361 (1991). Animals treated with 
10 H 2 0 and CO are used as controls. Two hours following the last injection (day 4), 
animals are killed, and the livers are removed. Livers are immediately frozen and 
stored at -70°C. 

Total RNA is prepared from frozen liver tissue using a modification of the 
method described by Xie et al, Biotechniques, 1 1 : 326-327 (1991). Approximately 
1 5 1 00-200 mg of liver tissue is homogenized in the RNA extraction buffer described by 
Xie et al to isolate total RNA. The resulting RNA is reconstituted in 
diethylpyrocarbonate-treated water, quantified spectrophotometrically at 260 nm, and 
adjusted to a concentration of 1 00 fig/ml. Total RNA is stored in 
- diethylpyrocarbonate-treated water for up to 1 year at -70°C without any apparent 
20 degradation. RT-PCR and sequencing are performed on samples from these 
preparations. 

For sequencing, samples of RNA corresponding to about 0.5 ng of poly(A) + 
RNA are used to construct libraries of tag-cDNA conjugates following the protocol 
described in the section entitled "Attaching Tags to Polynucleotides for Sorting onto 

25 Solid Phase Supports," with the following exception: the tag repertoire is constructed 
from six 4-nucleotide words from Table II. Thus, the complexity of the repertoire is 
8 6 or about 2,6 x 10 5 . For each tag-cDNA conjugate library constructed, ten samples 
of about ten thousand clones are taken for amplification and sorting. Each of the 
amplified samples is separately applied to a fixed monolayer of about 10 6 10 ^im 

30 diameter GMA beads containing tag complements. That is, the "sample" of tag 

complements in the GMA bead population on each monolayer is about four fold the 
total size of the repertoire, thus ensuring there is a high probability that each of the 
sampled tag-cDNA conjugates will find its tag complement on the monolayer. After 
the oligonucleotide tags of the amplified samples are rendered single stranded as 

35 described above, the tag-cDNA conjugates of the samples are separately applied to the 
monolayers under conditions that permit specific hybridization only between 
oligonucleotide tags and tag complements forming perfectly matched duplexes. 
Concentrations of the amplified samples and hybridization times are selected to 
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permit the loading of about 5 x 10 4 to 2 x 10 5 tag-cDNA conjugates on each bead 
where perfect matches occur. After ligation, 9-12 nucleotide portions of the attached 
cDNAs are determined in parallel by the single base sequencing technique described 
by Brenner in International patent application PCT/US95/03678. Frequency 
5 distributions for the gene expression profiles are assembled from the sequence 
information obtained from each of the ten samples. 

RT-PCRs of selected mRNAs corresponding to cytochrome P-450 genes and 
the constitutively expressed cyclophilin gene are carried out as described in Morris et 
al (cited above). Briefly, a 20 jiL reaction mixture is prepared containing lx reverse 

1 0 transcriptase buffer (Gibco BRL), 10 nM dithiothreitol, 0.5 nM dNTPs, 2.5 jaM oligo 
d(T)| 5 primer, 40 units RNasin (Promega, Madison, WI), 200 units RNase H-reverse 
transcriptase (Gibco BRL), and 400 ng of total RNA (in diethylpyrocarbonate-treated 
water). The reaction is incubated for 1 hour at 37°C followed by inactivation of the 
enzyme at 95°C for 5 min. The resulting cDNA is stored at -20°C until used. For 

1 5 PCR amplification of cDNA, a 10 |iL reaction mixture is prepared containing lOx 
polymerase reaction buffer, 2 mM MgC^, 1 unit Taq DNA polymerase (Perkin- 
Elmer, Norwalk, CT), 20 ng cDNA, and 200 nM concentration of the 5' and 3' 
specific PCR primers of the sequences described in Morris et al (cited above). PCRs 
-are carried out in a Perkin-Elmer 9600 thermal cycler for 23 cycles using melting, 

20 annealing, and extension conditions of 94°C for 30 sec, 56°C for 1 min., and 72°C 
for 1 min., respectively. Amplified cDNA products are separated by PAGE using 5% 
native gels. Bands are detected by staining with ethidium bromide. 

Western blots of the liver proteins are carried out using standard protocols 
after separation by SDS-PAGE. Briefly, proteins are separated on 1 0% SDS-PAGE 

25 gels under reducing conditions and immunoblotted for detection of P-450 isoenzymes 
using a modification of the methods described in Harris et al, Proc. Natl. Acad. Sci., 
88: 1407-1410 (1991). Protein are loaded at 50 jig/lane and resolved under constant 
current (250 V) for approximately 4 hours at 2°C. Proteins are transferred to 
nitrocellulose membranes (Bio-Rad, Hercules, CA) in 15 mM Tris buffer containing 

30 120 mM glycine and 20% (v/v) methanol. The nitrocellulose membranes are blocked 
with 2.5% BSA and immunoblotted for P-450 isoenzymes using primary monoclonal 
and polyclonal antibodies and secondary alkaline phosphatase conjugated anti-IgG. 
Immunoblots are developed with the Bio-Rad alkaline phosphatase substrate kit. 
The three types of measurements of P-450 isoenzyme induction showed 

3 5 substant i al agreement. 
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APPENDIX la 

Exemplary computer program for generating 
minimally cross hybridizing sets 
(single stranded tag/single stranded tag complement) 



Program minxh 

c 

c 

c 



c 
c 



c 
c 



integer* 2 subl (6) , msetl (1000, 6) ,mset2 (1000, 6) 
dimension nbase(6) 



write ( * , * ) ' ENTER SUBUNIT LENGTH ' 
read ( + , 100) nsub 
100 format (il) 

open ( 1, f ile= 1 sub4 .dat 1 , form=' formatted* , status= ' new • ) 



nset=0 

do 7000 ml-1,3 
do 7000 m2=l, 3 
do 7000 m3=l,3 

do 7000 m4 = l, 3 
subl (l)=ml 
subl (2)=m2 
subl ( 3) =m3 
subl (4 )=m4 



ndiff=3 



c 
c 

c Generate set of subunits differing from 

c subl by at least ndiff nucleotides. 

c Save in msetl . 

c 

c 

do 900 j=l,nsub 
900 msetl (1, j) =subl ( j) 

c 

c 

do 1000 kl = l, 3 
do 1000 k2=l,3 
do 1000 k3=l,3 
do 1000 k4=l,3 

c 
c 



nbase (1) =kl 
nbase (2)=k2 
nbase (3)=k3 
nbase (4 ) =k4 
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1200 

c 

c 



n=0 

do 1200 j=l,nsub 
if (subl ( j ) .eq. 1 
1 subl ( j ) .eq.2 

3 subf ( j } .eq. 3 

n=n+l 
endif 
continue 



if (n . ge . ndif f ) then 



.and. nbase { j) .ne. 1 .or. 
.and. nbase ( j) .ne.2 .or. 
.and. nbase ( j) .ne. 3) then 



c 
c 
c 
c 
c 
c 
c 



1100 

c 
c 

1000 

c 

c 



1325 



do 1100 i=l,nsub 

mset 1 ( j j , i ) =nbase ( i ] 
endif 



continue 



do 1325 j2=l,nsub 
mset2 (1, j2) =msetl { 1, j2) 
mset 2 (2, j2 ) =mset 1 (2, j2) 



If number of mismatches 
is greater than or equal 
to ndiff then record 
subunit in matrix mset 



c 
c 

c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 

c 
c 

1700 



npass=0 



continue 

kk=npass+2 

npass=npass+l 



Compare subunit 2 from 
msetl with each successive 
subunit in msetl, i.e. 3, 
4,5, ... etc. Save those 
with mismatches .ge. ndiff 
in matrix mset2 starting at 
position 2. 

Next transfer contents 
of mset2 into msetl and 
start 

comparisons again this time 
starting with subunit 3. 
Continue until all subunits 
undergo the comparisons. 
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1600 

1625 
1500 



do 1500 m=npass+2, j j 

n=0 ... 

do 1600 j=l,nsub 

if (inset 1 (npass + 1, j ) 
2 msetl {npass + 1, j ) 

2 msetl (npass + 1, j ) 

n=n+l 
endif 
continue 
if (n. ge. ndif f ) then 
kk=kk+l 

do 1625 i=l,nsub 

mset2 (kk, i) =msetl (m, i) 

endif 
continue 



eq. 1 .and. msetl (m, j ) . ne. 1 .or. 
eq.5 . and. msetl (m, j ) . ne. 2 . or . 
eq. 3. and. msetl (m, j ) .ne. 3 ) then 



c 
c 
c 
c 
c 
c 
c 



2000 



kk is the number of subunits 
stored in mset2 

Transfer contents of mset2 
into msetl for next pass. 



do 2000 k=l, kk 

do 2000 m=l,nsub 

msetl (k,m)=mset2 (k,m) 
if(kk.lt.jj) then 
jj = kk 
goto 1700 
endif 



7009 

7008 
7010 



120 
7000 

c 
c 



nset=nset+l 
write { 1, 7009) 

format ( / ) 
do 7008 k=l, kk 

write(l,7010) (msetl (k,m) ,m=l,nsub) 
format ( 4il ) 
write ( * , * ) 

write(*,120) kk, nset 

format (lx, 'Subunits in set = 1 , i 5, 2x, * Set No=\i5) 
continue 
closed) 



end 
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APPENDIX lb 

Exemplary computer program for generatinfi 
minimally cross hybridizing sets 
(single stranded tag/single stranded tag complement) 



Program tagN 

c 

c 

c Program tagN generates minimally cross-hybridizing 

c sets of subunits given i) N--subunit length, and ii) 

c an initial subunit sequence. tagN assumes that only 

c 3 of the four natural nucleotides are used in the tags. 

c 
c 

character*! subl (20) 

integer*2 mset ( 10000, 20} , nbase(20) 

c 
c 

write<*, *) 'ENTER SUBUNIT LENGTH 1 

read(*, 100)nsub 
100 format (i2) 

c 
c 

write {*,*} 'ENTER SUBUNIT SEQUENCE 1 

read(*, 110) (subl (k) , k=l,nsub) 
110 format{20al) 
c 

c 

ndiff=10 

c 

c Let a=l c=2 g=3 & t=4 



do 800 kk=l,nsub 

if (subl (kk) . eq. 'a 1 ) then 

mset (1, kk)=l 

endi f 

if (subl (kk) .eq. 'c' ) then 
mset (1, kk)=2 
endif 

if (subl (kk) .eq. 'g 1 ) then 
mset (1, kk) =3 
endif 

if (subl (kk) .eq. 't 1 ) then 
mset ( i, kk) =4 
endif 

800 continue 

c 

c 

c Generate set of subunits differing from 

c subl by at least ndiff nucleotides. 



c 
c 

do 1000 kl=l,3 
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do 1000 k2=l,3 
do 1000 k3=l,3 
do 1000 k4=l,3 
do 1000 k5=l,3 
do 1000 k6=l,3 
do 1000 k7=l, 3 
do 1000 k8=l, 3 
do 1000 k9=l,3 
do 1000 kl0=l,3 

do 1000 kll=l, 3 
do 1000 kl2=l, 3 
do 1000 k!3=l,3 
do 1000 kl4=l,3 
do 1000 kl5=l,3 
do 1000 kl6=l, 3 
do 1000 k!7=l,3 
do 1000 kl8=l,3 
do 1000 kl9=l,3 
do 1000 k20=l,3 



nbase { 1 ) - 
nbase (2) = 
nbase ( 3 ) = 
nbase ( 4 ) = 
nbase ( 5 ) = 
nbase { 6 ) = 
nbase (7)= 
nbase (8) = 
nbase (9) = 
nbase ( 10 J 
nbase (11) 
nbase (12) 
nbase (13) 
nbase < 1 4 ) 
nbase (15) 
nbase (16) 
nbase ( 17 ) 
nbase (18) 
nbase(19) 
nbase (20) 



kl 
k2 
k3 
k4 
k5 
k6 
k7 
>k8 
*k9 
= k!0 
= kll 
= kl2 
= kl3 
= kl4 
= kl5 
= kl6 
= k!7 
=kl8 
= kl9 
=k20 



1200 

c 

c 



1250 
c 



do 1250 nn=l, j j 
n-0 

do 1200 nsub 

if (mset (nn, j ) .eq. 

1 mset (nn, j ) .eq, 

2 mset (nn, j ) .eq, 

3 mset (nn, j ) . eq, 
n=n+ 1 

endif 
continue 



if (n.lt.ndiff ) then 

goto 1000 

endif 
continue 



1 .and. nbase (j ). ne . 1 .or. 

2 .and. nbase ( j ) . ne . 2 .or. 

3 .and. nbase { j ) . ne . 3 .or. 

4 .and. nbase ( j ) . ne . 4 ) then 



write (*, 130) (nbase (i) , i = l,nsub) , j j 
do 1100 i=l,nsub 
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mset ( j j / i ) =nbase ( i } 
1100 continue 

c 

1000 continue 

c 

c 

write (*, *) 
130 format ( lOx, 20 { lx f il ) , 5x, 15) 

write (*, *) 

writer, 120) jj 
120 format (Ix, 'Number of words=',i5) 

c 



c 
c 
c 



end 
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APPENDIX lc 

Exemplary computer program for generating 
minimally cross hybridizing sets 
(double stranded tag/single stranded tag complement) 

Program 3tagN 

c 

c 

c Program 3tagN generates minimally cross-hybridi zing 

c sets of duplex subunits given i) N--subunit length, 

c and ii) an initial homopurine sequence, 

c 

character*! subl (20) 
integer*2 mset ( 10000, 20 ) , nbase(20) 



c 
c 



write ( * , * ) ' ENTER SUBUNIT LENGTH ' 
read(*, 100) nsub 
100 format (i2) 



c 
c 

write { * , * ) ' ENTER SUBUNIT SEQUENCE a & g only 1 
readr, 110) (subl (k) , k=l,nsub) 
110 format (20al) 

c 

ndiff=10 

c 

c Let a=l and g=2 

c 

do 800 kk=l,nsub 

if (subl (kk) .eq. 'a* ) then 

mset (1, kk)=l 

endif 

if (subl (kk) . eq. 'g ' ) then 
mset (1, kk) =2 
endif 

800 continue 

c 



do 1000 kl-1, 3 
do 1000 k2=l, 3 
do 1000 k3=l, 3 
do 1000 k4=l,3 
do 1000 k5=l,3 
do 1000 k6=l, 3 
do 1000 k7 = l f 3 
do 1000 k8=l,3 
do 1000 k9=l f 3 
do 1000 kl0=l, 3 

do 1000 kll=l,3 
do 1000 kl2=l, 3 
do 1000 kl3=l,3 
do 1000 kl4 = l # 3 
do 1000 kl5=l,3 
do 1000 kl6=l, 3 
do 1000 kl7=l f 3 
do 1000 k!8=l, 3 
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do 1000 kl9=l,3 
do 1000 k20=l,3 



nbase 


( 1 ) = 


kl 


nbase 


2) = 


k2 


nbase 


;3> = 


k3 


nbase 


4 ) = 


k4 


nbase 


5 ) = 


k5 


nbase 


; 6) = 


k6 


nbase 


7 ) = 


k7 


nbase 


8) = 


k8 


nbase 


9) = 


k9 


nbase 

4 4 U LA w W 


10) 


= k!0 


nbase 


( 11 ) 


= kll 


nbase 


(12) 


= kl2 


nbase 


(13) 


= kl3 


nbase 


(14) 


= kl4 


nbase 


(15) 


*=kl5 


nbase 


(16) 


= kl6 


nbase 


(17) 


= kl7 


nbase 


(18) 


= kl8 


nbase 


(19) 


= k!9 


nbase 


(20) 


= k20 



do 1250 nn=l, j j 



1200 
c 



1250 
c 



1100 
c 

1000 
c 

130 



120 
c 



n=0 

do 1200 j=l,nsub 

if (mset ( nn, j ) . eq . 1 .and. nbase(j) 
mset (nn, j ) . eq . 2 .and. nbase(j) 
mset (nn, j ) . eq . 3 .and. nbase(j) 
mset (nn, j ) .eq. 4 .and. nbase (j) 
n=n+l 
endi f 
continue 

if (n. lt.ndif f ) then 

goto 1000 

endif 
continue 

write ( * , 130) (nbase (i),i=l,nsub),jj 
do 1100 i=l,nsub 

mset ( j j , i ) =nbase ( i ) 
continue 

continue 
write{*,*) 

format ( lOx, 20 ( lx, il ) , 5x, i5 ) 
write (*, *) 
write(*,120) jj 

f ormat ( ix, 1 Number of words=',i5) 



ne 
ne 
ne 
ne 



1 
2 
3 

4) 



or . 
or . 
or. 
then 



end 
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SEQUENCE LISTING 



(1) GENERAL INFORMATION: 



(i) APPLICANT: David W. Martin, Jr. 



(ii) TITLE OF INVENTION: Measurement of Gene Expression profiles in 
Toxicity Determination 



(iii) NUMBER OF SEQUENCES: 7 



£iv) CORRESPONDENCE ADDRESS: 

(A) ADDRESSEE: Stephen C. Macevicz, Lynx Therapeutics, Inc. 

(B) STREET: 3832 Bay Center Place 

(C) CITY: Hayward 

(D) STATE: California 

(E) COUNTRY: USA 

(F) ZIP: 94545 



(v) COMPUTER READABLE FORM: 

(A) MEDIUM TYPE: 3.5 inch diskette 

(B) COMPUTER: IBM compatible 

(C) OPERATING SYSTEM: Windows 3.1 

(D) SOFTWARE: Microsoft Word 5.1 



(vi) CURRENT APPLICATION DATA: 

(A) APPLICATION NUMBER: 

(B) FILING DATE: 

(C) CLASSIFICATION: 

(vii) PRIOR APPLICATION DATA: 

(A) APPLICATION NUMBER: PCT/US96/0951 3 

(B) FILING DATE: 06-JUN-96 



(vii) PRIOR APPLICATION DATA : 

(A) APPLICATION NUMBER: PCT/US95/127 91 

(B) FILING DATE: 12-OCT-95 

(viii) ATTORNEY /AGENT INFORMATION: 
(A) NAME: Stephen C. Macevicz 
<B) REGISTRATION NUMBER: 30,285 

(C) REFERENCE/ DOCKET NUMBER: 813wo 



(ix) TELECOMMUNICATION INFORMATION: 

(A) TELEPHONE: (510) 670-9365 

(B) TELEFAX: (510) 670-9302 



(2) INFORMATION FOR SEQ ID NO: 1: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 
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(xii SEQUENCE DESCRIPTION: SEQ ID NO: 1: 



CTAGTCGACC A 



(2) INFORMATION FOR SEQ ID NO: 2: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 11 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 
{D} TOPOLOGY: linear 



{xi) SEQUENCE DESCRIPTION: SEC ID NO: 2: 



NRRGATCYNN N 



(2) INFORMATION FOR SEQ ID NO: 3: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 38 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: single 

(D) TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 3: 



GAGGATGCCT TTATGGATCC ACTCGAGATC CCAATCCA 



(2) INFORMATION FOR SEQ ID NO: 4: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 nucleotides 
(BJ TYPE: nucleic acid 
< C ) STRANDEDNESS: double 
(D) TOPOLOGY: linear 



(xi; SEQUENCE DESCRIPTION: SEQ ID NO: 4: 



AGTGGCTGGG CATCGGACCG 



(2) INFORMATION FOR SEQ ID NO: 5: 



(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 nucleotides 

(B) TYPE: nucleic acid 
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(C) STRANDEDNESS : double 
(DJ TOPOLOGY: linear 



(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 5: 



GGGGCCCAGT CAGCGTCGAT 



20 



(2) INFORMATION FOR SEQ ID NO: 6: 

(1) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 20 nucleotides 

(B) TYPE: nucleic acid 
(C> STRANDEDNESS: single 
(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 6: 

ATCGACGCTG ACTGGGCCCC j 6< 

(2) INFORMATION FOR SEQ ID NO: 7: 

(i) SEQUENCE CHARACTERISTICS: 

(A) LENGTH: 62 nucleotides 

(B) TYPE: nucleic acid 

(C) STRANDEDNESS: double 

(D) TOPOLOGY: linear 

(xi) SEQUENCE DESCRIPTION: SEQ ID NO: 7: 

AAAAGGAGGA GGCCTTGATA GAGAGGACCT GTTTAAACGG ATCCTCTTCC 50 
TCTTCCTCTT CC 
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I claim: 

1 . A method of determining the toxicity of a compound, the method comprising 
the steps of: 

5 administering the compound to a test organism; 

extracting a population of mRNA molecules from each of one or more tissues 
of the test organism; 

forming a separate population of cDNA molecules from each population of 
mRNA molecules from the one or more tissues such that each cDNA molecule of a 
10 separate population has an oligonucleotide tag attached, the oligonucleotide tags 
being selected from the same minimally cross-hybridizing set; 

separately sampling each population of cDNA molecules such that 
substantially all different cDNA molecules within a separate population have different 
oligonucleotide tags attached; 
1 5 sorting the cDNA molecules of each separate population by specifically 

hybridizing the oligonucleotide tags with their respective complements, the respective 
complements being attached as uniform populations of substantially identical 
complements in spatially discrete regions on one or more solid phase supports; 

determining the nucleotide sequence of a portion of each of the sorted cDNA 
20 molecules of each separate population to form a frequency distribution of expressed 
genes for each of the one or more tissues; and 

correlating the frequency distribution of expressed genes in each of the one or 
more tissues with the toxicity of the compound. 

25 2. The method of claim 1 wherein said oligonucleotide tag and said complement 
of said oligonucleotide tag are single stranded. 

3. The method of claim 2 wherein said oligonucleotide tag consists of a plurality 
of subunits, each subunit consisting of an oligonucleotide of 3 to 9 nucleotides in 

30 length and each subunit being selected from the same minimally cross-hybridizing set. 

4. The method of claim 3 wherein said one or more solid phase supports are 
microparticles and wherein said step of sorting said cDNA molecules onto the 
microparticles produces a subpopulation of loaded microparticles and a subpopulation 

35 of unloaded microparticles. 

5. The method of claim 4 further including a step of separating said loaded 
microparticles from said unloaded microparticles. 
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6. The method of claim 5 further including a step of repeating said steps of 
sampling, sorting, and separating until a number of said loaded microparticles is 
accumulated is at least 10,000. 

5 

7. The method of claim 6 wherein said number of loaded microparticles is at 
least 100,000. 

8. The method of claim 7 wherein said number of loaded microparticles is at 
10 least 500,000. 

9. The method of claim 5 further including a step of repeating said steps of 
sampling, sorting, and separating until a number of said loaded microparticles is 
accumulated is sufficient to estimate the relative abundance of a cDNA molecule 

1 5 present in said population at a frequency within the range of from 0. 1 % to 5% with a 
95% confidence limit no larger than 0. 1% of said population. 

10. The method of claim 4 wherein said test organism is a mammalian tissue 
- culture. 

20 

1 1 . The method of claim 10 wherein said mammalian tissue culture comprises 
hepatocytes. 

12. The method of claim 4 wherein said test organism is an animal selected from 
25 the group consisting of rats, mice, hamsters, guinea pigs, rabbits, cats, dogs, pigs, and 

monkeys. 

13. The method of claim 12 wherein said one or more tissues are selected from the 
group consisting of liver, kidney, brain, cardiovascular, thyroid, spleen, adrenal, large 

30 intestine, small intestine, pancrease urinary bladder, stomach, ovary, testes, and 
mesenteric lymph nodes. 

14. A method of identifying genes which are differentially expressed in a selected 
35 tissue of a test animal after treatment with a compound, the method comprising the 

steps of: 

administering the compound to a test animal; 
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extracting a population of mRNA molecules from the selected tissue of the 
test animal; 

forming a population of cDNA molecules from the population of mRNA 
molecules such that each cDNA molecule has an oligonucleotide tag attached, the 
5 oligonucleotide tags being selected from the same minimally cross-hybridizing set; 

sampling the population of cDNA molecules such that substantially all 
different cDNA molecules have different oligonucleotide tags attached; 

sorting the cDNA molecules by specifically hybridizing the oligonucleotide 
tags with their respective complements, the respective complements being attached as 
10 uniform populations of substantially identical complements in spatially discrete 
regions on one or more solid phase supports; 

determining the nucleotide sequence of a portion of each of the sorted cDNA 
molecules to form a frequency distribution of expressed genes; and 

identifying genes expressed in response to administering the compound by 
15 comparing the frequencing distribution of expressed genes of the selected tissue of the 
test animal with a frequency distribution of expressed genes of the selected tissue of a 
control animal. 

- 15. The method of claim 1 4 wherein said oligonucleotide tag and said 
20 complement of said oligonucleotide tag are single stranded. 

1 6. The method of claim 1 5 wherein said oligonucleotide tag consists of a 
plurality of subunits, each subunit consisting of an oligonucleotide of 3 to 9 
nucleotides in length and each subunit being selected from the same minimally cross- 

25 hybridizing set. 

1 7. The method of claim 1 6 wherein said one or more solid phase supports are 
microparticles and wherein said step of sorting said cDNA molecules onto the 
microparticles produces a subpopulation of loaded microparticles and a subpopulation 

30 of unloaded microparticles. 

1 8. The method of claim 1 7 further including a step of separating said loaded 
microparticles from said unloaded microparticles. 

35 19. The method of claim 1 8 further including a step of repeating said steps of 
sampling, sorting, and separating until a number of said loaded microparticles is 
accumulated is at least 10,000. 
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20. The method of claim 1 9 wherein said number of loaded microparticles is at 
least 100.000. 

2 1 . The method of claim 20 wherein said number of loaded microparticles is at 
5 least 500,000. 

22. The method of claim 1 8 farther including a step of repeating said steps of 
sampling, sorting, and separating until a number of said loaded microparticles is 
accumulated is sufficient to estimate the relative abundance of a cDNA molecule 

1 0 present in said population at a frequency within the range of from 0. 1% to 5% with a 
95% confidence limit no larger than 0. 1% of said population. 

23. The method of claim 17 wherein said test animal is selected from the group 
consisting of rats, mice, hamsters, guinea pigs, rabbits, cats, dogs, pigs, and monkeys. 

15 

24. The method of claim 23 wherein said selected tissue is selected from the 
group consisting of liver, kidney, brain, cardiovascular, thyroid ; spleen, adrenal, large 
intestine, small intestine, pancrease urinary bladder, stomach, ovary, testes, and 

- mesenteric lymph nodes. 

20 

25. A use of the technique of massively parallel signature sequencing to determine 
the toxicity of a compound in a test organism, the use comprising the steps of: 

administering the compound to a test organism; 

extracting a population of mRNA molecules from each of one or more tissues 
25 of the test organism and forming a population of cDNA molecules for each of the one 
or more tissues; 

determining the nucleotide sequence of a portion of each of the cDNA 
molecules of each separate population using massively parallel signature sequencing 
to form a frequency distribution of expressed genes for each of the one or more 
30 tissues; and 

correlating the frequency distribution of expressed genes in each of the one or 
more tissues with the toxicity of the compound. 

26. The use of claim 25 wherein said test organism is a mammalian tissue culture. 

35 

27. The use of claim 26 wherein said mammalian tissue culture comprises 
hepatocytes. 
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28. The use of claim 25 wherein said test organism is an animal selected from the 
group consisting of rats, mice, hamsters, guinea pigs, rabbits, cats, dogs, pigs, and 
monkeys. 

5 29, The use of claim 28 wherein said one or more tissues are selected from the 
group consisting of liver, kidney, brain, cardiovascular, thyroid, spleen, adrenal, large 
intestine, small intestine, pancrease urinary bladder, stomach, ovary, testes, and 
mesenteric lymph nodes. 

10 30. A use of the technique of massively parallel signature sequencing to identify 
genes which are differentially expressed in a test organism after treatment with a 
compound and which are correlated with toxicity of the compound, the use 
comprising the steps of: 

administering the compound to the test organism; 
15 extracting a population of mRNA molecules from a selected tissue of the test 

organism and forming a population of cDNA molecules; 

determining the nucleotide sequence of a portion of each of the cDNA 
molecules using massively parallel signature sequencing to form a frequency 
- distribution of expressed genes; 
20 identifying genes expressed in response to administering the compound by 

comparing the frequencing distribution of expressed genes of the selected tissue of the 
test organism with a frequency distribution of expressed genes of the selected tissue 
of a control organism; and 

determining whether the genes expressed in response to administering the 
25 compound are correlated with toxicity of the compound in the test organism. 
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August 11, 1997, Monday 

SECTION: Financial News 

DISTRIBUTION: TO BUSINESS AND MEDICAL EDITORS 
LENGTH: 478 words * 

HEADLINE: Eli Lilly & Co. and Acacia Biosciences Enter Into Research Collaboration; 
First Corporate Agreement for Acacia's Genome Reporter Matrix(TM) 

DATELINE: RICHMOND, Calif., Aug. 11 

BODY: 

Acacia Biosciences and Eli Lilly and Company (Lilly) announced today the signing of a joint research collaboration 
to utilize Acacia's Genome Reporter Matrix(TM) (GRM) to aid in the selection and optimization of lead compounds. 
Under the collaboration, Acacia will provide chemical and biological profiles on a class of Lilly's compounds for an 
undisclosed fee. 

Acacia's GRM is an assay-based computer modeling system that uses yeast as a miniature ecosystem. The GRM 
can profile the extent, nature and quantity of any changes in gene expression. Because of the similarities between 
the yeast and human genome, the system serves as an excellent surrogate for the human body, mimicking the effects 
induced by a biologically active molecule. 

"Using yeast as a model organism for lead optimization makes a lot of sense given the high degree of homology with 
human metabolic pathways/ said William Current of Lilly Research Laboratories. "Acacia's innovative GRM has 
the potential to provide enormous insight into the therapeutic impact of our compounds and make the drug discovery 
process more rational. It should substantially accelerate the development process. " 

"This first agreement with a major pharmaceutical company is an important milestone in the development of 
Acacia," said Bruce Cohen, President and CEO of Acacia. "The deal is in line with our strategy of establishing 
alliances that will allow our collaborators to use genomic profiles to identify and optimize compounds within 
their existing portfolios. In the long run, this technology can be used to characterize large scale combinatorial 
libraries, predict side effects prior to clinical trials and resurrect drugs that have failed during clinical trials." 

The GRM incorporates two critical elements: chemical response profiles and genetic response profiles. The 
chemical response profiles measure the change in gene expression caused by potential therapeutics and then rank genes 
with altered expressions by degree of response. The genetic response profiles measure changes in gene expression 
caused by mutations in the genes encoding potential targets of pharmaceuticals; these genetic response profiles represent 
gold standards in drug discovery by defining the response profile expected for drugs with perfect selectivity and 
specificity. By comparing the two profiles, one can analyze a potential drug candidate's ability to mimic the action of 
a 'perfect' drug. 

Acacia Biosciences is a functional genomics company developing proprietary technologies to enhance the speed 
and efficacy of drug discovery and development. Acacia's Genome Reporter Matrix capitalizes on the latest advances 
in genomics and combinatorial chemistry to generate comprehensive profiles of drug candidates' in vivo activity. 
SOURCE Acacia Biosciences 

CONTACT: Bruce Cohen, President and CEO of Acacia Biosciences, 510-669-2330 ext. 103 or Media: Linda 
Seaton of Feinstein 
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Pharmagene 
Raises More 
Capital for 
Research on 
Human 
Tissues 

By Sophia Fox 

Pharmagene. the Royston, 
U.K. -based biopharmaceuri- 
cal company specialising in 
the use of human biomaterials for 
drug discovery research, has raised a 
further £5 million from a group of 
investors led by 3i and Abacus 
Nominees. The funding will enable 
the company to expand both its 
human biomaterials collection and 
its capabilities across a range of pro- 
prietary platform technologies. 

Gordon Baxter, Ph.D., 
Pharmagene^ cofounder and chief 
operating officer, claimed "by the 
end of this year Pharmagene will 
have access to the largest collection 
of human RNAs and proteins any- 
where in the world, and a range of 
innovative, yet robust technologies 
SEE PHARMAGENE, P. 9 



Perkin-Elmer Acquires PerSeptive to Expand 
Its Capabilities in Cene-BasedDrug Discovery 



By John Sterling 

Perkta-Elmer 1 * (PE; Norwalk, 
CT) decision last month to 
acquire PerSeptive Blo- 
systems (Framingham, MA) via a 
$360 million stock swap was 
designed to strengthen PE in terms 
of broad capabilities in gene-based 
drug discovery. The company^ 
main goal is to develop new prod- 
ucts to improve the integration of 
genetic and protein research. 

'This merger will enhance our 
position as an effective provider of 
innovative, integrated platforms 
enabling our customers to be more 
efficient and cost-effective in bring- 
ing new pharmaceuticals to mar- 
ket;* says Tony L. White, PEls 
chairman, president and CEO. "The 
combination of our two companies 
should bolster our presence in the 
life sciences, [and it is our] belief 
that wc must take bold action now 
to lead the emerging era of molecu- 
lar medicine with leading positions 
in both genetic and protein analy- 
sis." 

A driving force behind the 
merger is the vast amount of genet- 



FDA OKs Genzyme's Carticel 
Product for Damage to Knees 
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Gonryme Tissue Repair 



Celt Processing 



Carticel which was approved for the repair of clinically significant, symp- 
tomatic cartilaginous defects of the femoral condyle (medial, lateral or 
trochlear) caused by acute or repetitive trauma, employs a proprietary 
process to grow autologous cartilage cells for implantation. 



By Naomi PfeifTer 

The FDA has approved a knee- 
cartilage replacement product 
made by Cenzyme Tissue 
Repair (Cambridge, MAX a track- 
ing-stock division of Genzyme 
Corp., for people with trauma- 
damaged knees. 

Carticel" (autologous cultured 
chondrocytes) is the first product to 
be licensed under the FDAls pro- 
SEE GENZYME, P. 6 
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Perldn- 
Elmer 
acquired 
PerSeptive 
Biosystems 
for $360 
million to 
obtain new 
technologies 
in mass 
spectrome- 
try, biosepa- 
ran'ons and 
purification 
for product 
development 
projects, 
spanning the 
range from 
genomics to 
protcomics. 



ic information about human dis- 
ease that is being accumulated by 
researchers and biotech companies 
working in the area of genomics. It 
is becoming increasingly obvious 
that these data need to be comple- 
mented with technologies for 



studying proteins and protein net- 
works — a field known as pro- 
tcomics (see GEN, September I. 
1997, p. I). 

PE officials, who claim that 
MALDI-TOF (Matrix Assisted 
SEE ACOUtSmON, P. 10 



Strategies for Target Validation 
Streamline Evaluation of Leads 



ByVkkJGlaser 

A cada Biosciences (Rich- 
f\ mond, CA) last month 
X Jkannounced its first agree- 
ment with a major pharmaccuticaJ 
company, signing a deal with EU 
Lilly (Indianapolis, IN) to use 
Acacia^ Genome Reporter Matrix 
(GRM) to select and optimize some 
of Ulrylj lead a«npound& Acacia t 
yeast-based system for profiling 
drug activity is useful for evaluating 
the therapeutic potential, of lead 
compounds, and it also has a role in 
the identification and validation of 
new drug targets. 

"We're using the ecosystem of a 
cell to allow us to deduce the mech- 
anism of action and target for any 
chemical;* explains Bruce Cohen, 
president and CEO. "We screen for 
every target in a cell simultaneous- 
ly.., using transcription as a readout 



for how a cell is adapting to any 
perturbation," he says. 

The GRM technology consists of 
two main databases: one is the 
genetic response profile, showing 
the effects of mutations in each 
individual yeast gene and compen- 
satory gene regulatory mecha- 
nisms; the other is the chemical 
response profile, which documents 
changes in gene expression in 
response to chemical compounds. 
Computational analysis and pattern 
matching between the genetic and 
chemical profiles yields informa- 
tion on the specificity, potency and 
side-effects risk of a drug lead. 

Targeting Targets 

No longer is mapping and 
sequencing a gene — or the human 
genome — an end unto itself, but 
SEE TARGET, P. 15 



Sticky Ends 

Avigen received two 
grants from the NIH & 
University of Cali- 
fornia for research 
on gene therapy for 
treatment of cancer & 
HIV infections. . .MRL 
Pharmaceutical Servi- 
ces, of Reston, VA, 
launched the TSN Bug 
Finder, which is able 
to locate & retrieve 
client -specified mi- 
croorganisms in real- 
time. . .Oensia Sicor, 
Inc. will move its 
corporate staff from 
San Diego to Irvine, 
CA, by end of year... 



FDA accepted NDA from 
Sepracor for levalbu- 
terol HC1 inhalation 
solution. . .An $11. 7M 
mezzanine financing 
has been closed by 
Activated Cell Thera- 
py, which changed its 
name to Dendreon Cor- 
poration. . .Astra AB 
will build major re- 
search facility in 
Walt ham, MA, and is 
also relocating Astra 
Xrcua research facil- 
ity from Rochester to 
Boston area. . .Prolif- 
ic Ltd. team used a 
small peptide to in- 
hibit the E2F protein 
complex and induced 



apoptosis in mammali- 
an tumor cells. . .Ver- 
tex Pharmaceuticals , 
Inc. and Alpha Thera- 
peutic Corp. ended an 
agreement to develop 
VX-366 for treatment 
of inherited hemoglo- 
bin disorders. . .Navi- 
Cyte received Phase I 
SBIR grant for up to 
$100,000 from NIH for 
development of proto- 
type of its NaviFlow 
technology for high- 
throughput screening 
. . , Covan.ee Inc. will 
invest $21 million in 
expansion and renova- 
tion of its facility 
in Indianapolis, IN. 
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merely a means to an end. The criti- 
cal next step is to validate the gene 
and its protein product as a potential 
drug target. The Human Genome 
Project continues to produce o trea- 
sure chest of expressed sequence 
tags (ESTs) and a tantalizing array of 
complete gene sequences. 

Companies are applying a variety 
of functional genomic strategies to 
link genes to specific diseases and to 
multigenic phenotypes. Yet the ulti- 
mate challenge for pharmaceutical 
companies is to sift through all the 
sequence and differential gene 
expression data to identify the best 
targets for drug discovery. 

Spinning off technology devel- 
oped at the University of North 
Carolina (Chapel Hill), Cytogen 
Corp. (Princeton, NJ) formed its 
wholly owned subsidiary AxCetl 
Biosciences earlier this year. The 
young company is building a protein 
interaction database, cataloging all 
the interactions the modular domains 
of proteins can engage in with a 



range of ligands, in order to gain 
insight into protein function and to 
select the most critical interaction to 
target for drug development. 

AxCetl s cloning-of-ligand-targcts 
(COLT) technology employs "recog- 
nition units" from the company* 
genetic diversity library (GDL) to 
map functional protein interact ioas 
and quantitatc their affinity. The 
company's inter-functional protcom- 
ic database (IFP-dbasc) elucidates 
protein interaction networks and 
structure-activity relationships based 
on ligand affinity with protein mod- 
ular domains. 

Denning Disease Pathways 

Signal Pharmaceuticals, Inc. 1 * 
(San Diego, CA) integrated drug tar- 
get and discovery effort is based on 
mapping gene-regulating pathways in 
cells and identifying small molecules 
that regulate the activation of those 
genes. In collaboration with academ- 
ic researchers, the company has iden- 
tified a large number of regulatory 
proteins in several mitogen-acttvated 
protein (MAP) kinase pathways 
(including the JNK, FRK and p38 
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The Genome 
Reporter 
Matrix depicts 
a suhsct of a 
j*m/ array*. 
Each colony 
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army report* 
the expression 
of all yeast 
genes. 
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Acacia 



signaling pathways), which Signal is 
evaluating for the treatment of 
autoimmune, inflammatory, cardio- 
vascular and neurologic diseases, and 
cancer. Other target identification 



programs focus on the NF-kB path- 
way, estrogen-related genes and cen- 
tral/peripheral nervous system genes. 

Regulating cytokine production in 
immune and inflammatory disorders. 




A strong chemical combination to help you grow. And flourish. 

Three hundred million dollars and ten years of hard work. That's what it costs to bring your biotechnology- 
dcriml therapeutic to ihe marketplace. 
Which means, no room for error. 

Which means, in turn, you'd be wise to tap into the combined capabilities of Mallinckrodt and J.T.Baker: 
dual sources, trusted names for your chemical raw materials. 

Iwo separate GMP-produced brands offering the control of a single quality system and the convenience of a 
single audit process. 

We offer comprehensive product lines including USP salts, bioreagents. high purity solvents and 
chromatography products in Beaker to Bulk™ packaging for easy scale-up. 

Oil I 1-H0O-S82-2S37, or access our website at httpyAvw.malihakercom. For dual chemical sources dedicated 
to helping you grow. Flourish. Succeed! 
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and modifying bone metabolism to 
treat osteoporosis arc the focus of 
Signal's collaboration with Tanabe 
Sciyaku (Osaka, Japan). Signal has 
partnered with Organon/Akzo 
Nobel (Netherlands) to identify 
cstrc^cn-rcsponsivc genes as targets 
for treating neurodegenerative and 
psychiatric diseases, atherosclerosis 
and ischemia, and with Roche 
Bioscience (Palo Alto. CA) to devel- 
op human peripheral nerve cell lines 
for the discovery of treatments for 
pain and incontinence. 

Excllxb* (S. San Francisco, CA) 
strategy for target selection is to 
define disease pathways and identify 
regulatory molecules that activate or 
inhibit those biochemical/genetic 
pathways. Based on the finding that 
these pathways are conserved across 
species, the company is studying the 
model genetic systems of Drosophila 
and Caenorhabditis elegans. Using 
its PathFinder technology, Exelixis 
systematically introduces mutations 
into the genomes of these model 
organisms, looking for mutations 
that enhance or suppress the target 
disease-related gene. These novel 
genes then become the basis of drug 
screening assays. 

Cadus Pharmaceutical Corp. 
CTarrytown, NY) is identifying sur- 
rogate ligands to newly discovered 
orphan G-protein coupled trans- 
membrane receptors of unknown 
function to determine the suitability 
of the receptors as drug targets. 
Inserting the novel receptor in a 
yeast system yields a ligand thai 
activates the receptor. Access to a 
surrogate ligand allows the company 
to screen for receptor antagonists in 
the yeast system. 

"The antagonist plus the surro- 
gate ligand gives you two probes— 
an on probe and an off probe — 
which allows you to look at func- 
tion;* explains David Webb, Ph.D., 
vp of research and chief scientific 
officer. A surrogate ligand also pro- 
vides information on which G-pro- 
tein interacts with the orphan recep- 
tor and its associated signaling path- 
ways, further clarifying the role of 
the receptor as a potential drug tar- 
get. Cadus' collaboration with 
SmithKline (Philadelphia) capital- 
izes on Cadus' ability to determine 
orphan receptor function, applying 
the technology to SmithKline 's pro- 
prietary, newly discovered G-pro- 
tein receptors. 

Cadus' recombinant yeast system 
can also be used to screen cell and 
tissue extracts for natural ligands. 
ami the company is accelerating its 
internal drug-discovery efforts in the 
areas of cancer, inflammation and 
allergy. A recent equity investment in 
Axiom Biotechnologies (San Diego, 
CA) gave Cadus a license to Axioms 
high-throughput pharmacologic 
screening system for lead optimiza- 
tion and discovery. 

As its name implies, 
gene/Networks (Alameda, CA) 
focuses on identifying gene networks 
that contribute to multigenic pheno- 
types and complex disease process- 
es. The integration of mouse and 
human genetic studies forms the 
basis of the technology. The Genome 
Tagged Mice database in develop- 
ment will serve as a library of natur- 
al mouse genetic and phenotypic 
variation. Disease-related genes 
identified in mice are then evaluated 
in human family- and population- 
based studies to confirm their clini- 
cal relevance and linkages to patho- 
physiologic traits. 

Blocking Gene Expression 

Inactivating a gene known to be 
expressed in atssoemtion with a par- 
ticular disease is one approach to 
identifying appropriate therapeutic 
targets. The target validation and dis- 
covery program at Ribozymc 
Pharmaceuticals, Inc. (Boulder. 
C( )) applies the company s rirx»/ymc 
tcvhnnli^y to achieve selective inhi- 
bition of gene expression in cell cul- 
ture and in animals. 

Correlation of the gene expres- 
sion inhibition with nhenoiype can 
SEE TARGET. P. 38 
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suggest the relative importance of 
the gene in disease pathology. The 
company's nuc lease-resistant 
ribozymes form the basts of a col- 
laboration with Scnering AG 
(Germany) for drug target validation 
and the development of ribozyme- 
based therapeutic agents, and with 
Chiron Corp. (Emeryville, CA) for 
target validation. 

With several antisense compounds 
now progressing through clinical tri- 
als, the concept of using oligonu- 
cleotides to inhibit gene activity is 
not new. But rather than focusing on 
therapeutics development, Sequitur; 
Inc. (Natick, MA) is creating anti- 
sense compounds for the purpose of 
determining gene function and vali- 
dating drug targets. Clients typically 
provide the one-year-old company 
with the sequence (or EST) of a 
potential gene target and, in return, 
Sequitur custom designs a series of 
three to six antisense compounds that 
yield a three-to-ten-fold inhibition of 
the target gene in cell culture. The 
company also provides oligofectins, 
a series of canonic lipids, to deliver 
the oligonucleotides to a variety of 
cultured cells. 

"Differential expression informa- 
tion is just for correlation, it doesn't 
tell function or confirm what would 
be a good target,** says Tod Woolf, 
Ph.D., director of technology devel- 
opment at Sequitur. Whereas, anti- 
sense compounds will inhibit a tar- 
get. Sequitur offers both phospho- 
rothioate DNA antisense com- 
pounds, and its proprietary Next 
Generation chimeric oligonu- 
cleotides, which have a higher 
hybridization affinity, greater speci- 
ficity and reduced toxicity, according 
to the company. 

Mining Pathogen Genomes 

Companies such as Human 
Genome Sciences (HGS; Rockvillc, 
MD). locyte (Palo Alto, CA), 




AxCell Biosciences scientists say their technology enables the rapid and 
simple functional identification of the two essential molecular components 
of protein interaction networks: specific recognition units that bind distinct 
modular protein domains are identified and isolated using a combination 
structural/functional approach that uses both peptide phase display Genetic 
Diversity Libraries (GDI) and bioinformatics, and cloning of Ligand 
Targets (COLT) technology utilizes recognition units as functional probes to 
isolate families of interactor proteins. 



Millennium Pharmaceuticals Inc. 
(Cambridge, MA) and Genome 
Therapeutics (Waltham, MA) are 
relying on high-speed DNA sequenc- 
ing, positional cloning and other 
strategies to identify specific micro- 
bial genomic sites that would be 
good targets for infectious disease 
therapeutics. 

HGS recently completed sequenc- 
ing of the bacterial pathogen 
Streptococcus pneumoniae, which is 
the focus of an agreement with 
Hoffmann-La Roche (Basel, 
Switzerland). Roche will use the 
sequence data to develop new anti- 
infectives against S. pneumoniae. 
HGS and Roche have expanded their 
collaboration to include a nonexclu- 
sive license to access sequence infor- 
mation for the intestinal bacterium 
Enterococcus faecalis. 

Incyte Pharmaceuticals has com- 
pleted one- fold coverage of the 
Candida albicans genome, identify- 



ing 60% of the genes of this fungal 
pathogen. This genome will become 
part of the company^ PathoSeq 
microbial database. Incyte recently 
introduced the ZooSeq animal gene 
sequence and expression database. 
The database will provide genomic 
information across various species 
commonly used in preclinical drug 
testing, which may help to better 
define potential drug targets. 

Millennium Pharmaceuticals con- 
tinues to report success in identifying 
novel drug targets, having recently 
discovered a novel chemokine called 
neurotactin and a new class ofMAD- 
related proteins that inhibit trans- 
forming growth factor beta (TGF-Q) 
signaling. The company also 
received US. patent coverage for the 
tub genes, believed to play a role in 
obesity, and for the gene that encodes 
the protein metastatic which appears 
to suppress metastasis in malignant 
melanoma. * 
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HIGH SPECIFIC ACTIVITY 
MICROBIAL ALKALINE 
PHOSPHATASE 
from Biocatalysts 

Biocatalysts Limited, the British speciality enzyme 
company, has developed a completely new type of 
alkaline phosphatase with many advantages over the 
types most commonly used. 
It is of microbial origin with a high specific activity 
(unlike that from E coli) and with higher temperature and 
storage stability compared to that from calf intestine. 
This is the first of several new generation diagnostic 
enzymes being developed by Biocatalysts Limited with 
greatly improved stability. 

• Non-animal source, no risk of BSE or animal 
virus contamination 

• Higher temperature stability than calf Intestine 

• Much higher specific activity than from E. coll 

• Very high storage stability even in the absence 
of glycerol 

For further details on alkaline phosphatase and our other 
diagnostic enzymes contact us direct at the address below or 
within North America contact our US Distributor Kattron-Rettibone 
'phone: 630 350 11 16 or tax: 630-350-1606 

Biocatalysts limited 

Trsforest Industrial Estate Pontypridd Wales UK CF37 SUD 
Tel: -1-44 (0)1443 843712 Fa* 444 (0)1443 841214 
a-inail>Kel)y@B4ocatalystsxom. 
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Pangea 

tram page* 28 

SmitK now a computer program- 
mer, is an expert in systems integra- 
tion, Internet technologies and the 
application of industrial engineering 
principles to the drug discovery 
process. Before co- founding Pangea, 
he was the manager of software 
development at Attorney s Briefcase, 
a legal research software company. 

By being "in the trenches" with 
customers and collaborators, 
Bellenson and Smith sensed the 
frustration of pharmaceutical 
researchers whose incompatible 
tools have impeded their progress. 
According to Bellenson, "Most of 
them are geared toward analyzing 
one molecule at a rime. ItTs like emp- 
tying the ocean with an eye drop- 
per — an incompatible eye dropper at 
that. A pharmaceutical company 
may have 30 different drug discov- 
ery teams with various approaches. 
The problem is to manage the 
process of experimenting with a lot 
of different approaches, to automate 
while maintaining flexibility" 

Gene World 2.1 enables "integra- 
tion of the entire target discovery and 
validation process,* 1 Bellenson says. 
The commercial software package 
coordinates the entire process of 
sequence-data analysis and can be 
integrated with other programs and 
databases, according to Smith, who 
adds that it handles thousands of 
sequence results, organizes and auto- 
mates annotation and seamlessly 
interacts with growing genome data- 
bases. Simple forms and menus 
enable users to turn raw sequence 
data into crucial knowledge for drug 
discovery by applying algorithms to 
sequences, creating custom analysis 
strategies and producing useful 
reports, without the need for writing 
computer code. Gene World 2.1 runs 
on a variety of platforms and operat- 
ing systems. 

Pairing industrial relational data- 
base-management systems with a 
web-browser interface, Pangeas 
Operating System of Drug 
Discovery™ is an open-computing 
framework that allows client/server 
and Java-enabled web-based tech- 
nologies to collect, organize and ana- 
lyze drug discovery information for 
pharmaceutical companies to simpli- 
fy and accelerate drug discovery. The 
technology unites automated 
genomics database analysis for drug 
target site selection, chemical infor- 
mation database analysis and large- 
scale combinatorial chemistry pro- 
ject management and high-through- 
put screening project management 
for drug lead efficacy analysis. 
Pangea officials maintain that these 
integrated elements provide a unified 
environment for chemists, biologists 
and others involved in the drug dis- 
covery process to work together with 
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commercial and public domain 
software. 

Pangeas Operating System of 
Drug Discovery can accommodate 
Sybase, Oracle or Informix relation- 
al database-management systems 
and any version of UNIX. It absorbs 
new data formats, databases, algo- 
rithms and analysis paradigms into 
the automated workflow without 
software modifications. Netscape 
Navigator" provides a friendly user 
interface from PC, Macintosh, and 
UNIX workstations. 

In the near term, Pangea plans to 
complete its bioinformatics core 
with two more programs. Gene 
Foundry, a sample tracking and 
workflow sequence package for 
DNA sequence and fragment infor- 
mation, will also offer interaction 
with robots, reagent tracking and 
troubleshooting. Gene Thesaurus, 
the other package is a "warehouse 
of bioinformatics data" says 
Bellenson. ■ 
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GTAC Chairman, Professor 
Norman C. Nevin, said 1996 saw 
•four important developments": an 
increase in enquiries and submis- 
sions made to GTAC; an increase in 
the complexity of submitted proto- 
cols; a continuing shift from gene 
therapy for single-gene disorders 
toward strategies aimed at tumour 
destruction in cancer; and a growth 
in international sponsorship of UK. 
gene therapy trials. 

Since 1993. GTAC and its prede- 
cessor, the Clothier Committee, have 
approved 18 UK. gene therapy clini- 
cal trials (13 of which have been car- 
ried out), which are listed in the 
report The disease areas targeted by 
these trials include severe combined 
immunodeficiency (1 trial), cystic 
fibrosis (6), metastatic melanoma (2), 
rymphoma (2), neuroblastoma (IX 
breast cancer (1), Hurlers syndrome 
( 1 K cervical cancer ( I ), glioblastoma 



breast cancer, breast cancer with liver 
metastases, glioblastoma, malignant 
ascites due to gastrointestinal cancer 
and ovarian cancer. 

Copies of the GTAC thrid annual 
report are available from the GTAC 
Secretariat, Wellington House, 133- 
155 Waterloo Road, London SE1 
8UG,UK 

Coated Lenses Prevent PCO 

Scientists in the UK. say it may be 
possible to prevent posterior capsule 
opacification (PCO), a common 
complication following cataract 
surgery, by using the implanted poly- 
methylmethacrylate (PMMA) 
intraocular tens as a drug delivery 
system. PCO occurs in 30-50% of 
cataract surgery patients as a result of 
stimulated cell growth within the 
remaining capsular bag. The condi- 
tion causes a decline in visual acuity 
and requires expensive laser treat- 
ment, thus negating the routine use of 
cataract surgery in underdeveloped 
countries, explains G. Duncan, al the 
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and Gametic Comtroi of 
on a Genomic Scale 

Joseph L DeRisi, Vishwanath R. Iyer, Patrick O. Brown* 

DNA microarrays containing virtually every gene of Saccharomyces cerevisiae were used 
to carry out a comprehensive investigation of the temporal program of gene expression 
accompanying the metabolic shift from fermentation to respiration. The expression 
profiles observed for genes with known metabolic functions pointed to features of the 
metabolic reprogramming that occur during the diauxic shift, and the expression patterns 
of many previously uncharacterized genes provided clues to their possible functions. The 
same DNA microarrays were also used to identify genes whose expression was affected 
by deletion of the transcriptional co-repressor TUP1 or overexpression of the transcrip- 
tional activator YAP1. These results demonstrate the feasibility and utility of this ap- 
proach to genomewide exploration of gene expression patterns. 



Xhe complete sequences of nearly a dozen 
microbial genomes are known , and in the 
next several years we expect to know the 
complete genome sequences of several 
metazoans, including the human genome. 
Defining the role of each gene in these 
genomes will be a formidable task, and un- 
derstanding how the genome functions as a 
whole in the complex natural history of a 
living organism presents an even greater 
challenge. 

Knowing when and where a gene is 
expressed often provides a strong clue as to 
its biological role. Conversely, the pattern 
of genes expressed in a cell can provide 
detailed information about its state. Al- 
though regulation of protein abundance in 
a cell is by no means accomplished solely 
by regulation of mRNA, virtually all dif- 
ferences in cell type or state are correlated 
with changes in the mRNA levels of many 
genes. This is fortuitous because the only 
specific reagent required to measure the 
abundance of the mRNA for a specific 
gene is a cDNA sequence. DNA microar- 
rays, consisting of thousands of individual 
gene sequences printed in a high-density 
array on a glass microscope slide (J, 2), 
provide a practical and economical tool 
for studying gene expression on a very 
large scale (3-6). 

Saccharomyces cerevisiae is an especially 

Department of Biochemistry, Stanford University School 
of Medicine. Howard Hughes Medical Institute. Stanford, 
CA 94305-5428, USA. 

*To whom correspondence should be addressed. E-mail: 
pbrown@cmgm.stanford.edu 



favorable organism in which to conduct a 
systematic investigation of gene expression. 
The genes are easy to recognize in the ge- 
nome sequence, cis regulatory elements are 
generally compact and close to the tran- 
scription units, much is already known 
about its genetic regulatory mechanisms, 
and a powerful set of tools is available for its 
analysis. 

A recurring cycle in the natural history 
of yeast involves a shift from anaerobic 
(fermentation) to aerobic (respiration) me- 
tabolism. Inoculation of yeast into a medi- 
um rich in sugar is followed by rapid growth 
fueled by fermentation, with the production 
of ethanol. When the fermen table sugar is 
exhausted, the yeast cells turn to ethanol as 
a carbon source for aerobic growth. This 
switch from anaerobic growth to aerobic 
respiration upon depletion of glucose, re- 
ferred to as the diauxic shift, is correlated 
with widespread changes in the expression 
of genes involved in fundamental cellular 
processes such as carbon metabolism, pro- 
tein synthesis, and carbohydrate storage 
(7). We used DNA microarrays to charac- 
terize the changes in gene expression that 
take place during this process for nearly the 
entire genome, and to investigate the ge- 
netic circuitry that regulates and executes 
this program. 

Yeast open reading frames (ORFs) were 
amplified by the polymerase chain reaction 
(PCR), with a commercially available set of 
primer pairs (8). DNA microarrays, con- 
taining approximately 6400 distinct DNA 
sequences, were printed onto glass slides by 



using a simple robotic printing device (9). 
Cells from an exponentially growing culture 
of yeast were inoculated into fresh medium 
and grown at 30°C for 21 hours. After an 
initial 9 hours of growth, samples were har- 
vested at seven successive 2-hour intervals, 
and mRNA was isolated (10). Fluorescently 
labeled cDN A was prepared by reverse tran- 
scription in the presence of Cy3(green)- 
or Cy5(red)-labeled deoxyuridine triphos- 
phate (dUTP) ( J i ) and then hybridized to 
the microarrays (12}. To maximize the re- 
liability with which changes in expression 
levels could be discerned, we labeled cDN A 
prepared from cells at each successive time 
point with Cy5, then mixed it with a Cy3- 
labeled "reference" cDNA sample prepared 
from cells harvested at the first interval 
after inoculation. In this experimental de- 
sign, the relative fluorescence intensity 
measured for the Cy3 and Cy5 fluors at 
each array element provides a reliable mea- 
sure of the relative abundance of the corre- 
sponding mRNA in the two cell popula- 
tions (Fig. 1). Data from the series of seven 
samples (Fig. 2), consisting of more than 
43,000 expression- ratio measurements, 
were organized into a database to facilitate 
efficient exploration and analysis of the 
results. This database is publicly available 
on the Internet (13). 

During exponential growth in glucose- 
rich medium, the global pattern of gene 
expression was remarkably stable. Indeed, 
when gene expression patterns between the 
first two cell samples (harvested at a 2-hour 
interval) were compared, mRNA levels dif- 
fered by a factor of 2 or more for only 19 
genes (0.3%), and the largest of these dif- 
ferences was only 2.7-fold (14). However, as 
glucose was progressively depleted from the 
growth media during the course of the ex- 
periment, a marked change was seen in the 
global pattern of gene expression. mRNA 
levels for approximately 710 genes were 
induced by a factor of at least 2, and the 
mRNA levels for approximately 1030 genes 
declined by a factor of at least 2. Messenger 
RNA levels for 183 genes increased by a 
factor of at least 4, and mRNA levels for 
203 genes diminished by a factor of at least 
4. About half of these differentially ex- 
pressed genes have no currently recognized 
function and are not yet named. Indeed, 
more than 400 of the differentially ex- 
pressed genes have no apparent homology 
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to any gene whose function is known (15). 
The responses of these previously unchar- 
acterized genes to the diauxic shift therefore 
provides the first small clue to their possible 
roles. 

The global view of changes in expres- 
sion of genes with known functions pro- 
vides a vivid picture of the way in which 
the cell adapts to a changing environ- 
ment. Figure 3 shows a portion of the yeast 
metabolic pathways involved in carbon 
and energy metabolism. Mapping the 
changes we observed in the mRNAs en- 
coding each enzyme onto this framework 
allowed us to infer the redirection in the 
flow of metabolites through this system. 
We observed large inductions of the genes 
coding for the enzymes aldehyde dehydro- 
genase (ALD2) and acetyl-coenzyme 
A(CoA) synthase (ACS J), which func- 
tion together to convert the products of 
alcohol dehydrogenase into acetyl-CoA, 
which in turn is used to fuel the tricarbox- 
ylic acid (TCA) cycle and the glyoxylate 
cycle. The concomitant shutdown of tran- 
scription of the genes encoding pyruvate 
decarboxylase and induction of pyruvate 
carboxylase rechannels pyruvate away 
from acetaldehyde, and instead to oxalac- 
etate, where it can serve to supply the 
TCA cycle and gluconeogenesis. Induc- 
tion of the pivotal genes PCfCl, encoding 
phosphoenolpyruvate carboxykinase, and 
FBP1, encoding fructose 1,6-biphos- 
phatase, switches the directions of two key 
irreversible steps in glycolysis, reversing 
the flow of metabolites along the revers- 
ible steps of the glycolytic pathway toward 
the essential biosynthetic precursor, glu- 
coses-phosphate. Induction of the genes 
coding for the trehalose synthase and gly- 
cogen synthase complexes promotes chan- 
neling of glucose-6-phosphate into these 
carbohydrate storage pathways. 

Just as the changes in expression of 
genes encoding pivotal enzymes can pro- 
vide insight into metabolic reprogram- 
ming, the behavior of large groups of func- 
tionally related genes can provide a broad 
view of the systematic way in which the 
yeast cell adapts to a changing environ- 
ment (Fig. 4). Several classes of genes, 
such as cytochrome c-related genes and 
those involved in the TCA/glyoxylate cy- 
cle and carbohydrate storage, were coord i- 
nately induced by glucose exhaustion. In 
contrast, genes devoted to protein synthe- 
sis, including ribosomal proteins, tRNA 
synthetases, and translation, elongation, 
and initiation factors, exhibited a coordi- 
nated decrease in expression. More than 
95% of ribosomal genes showed at least 
twofold decreases in expression during the 
diauxic shift (Fig. 4) (13). A noteworthy 
and illuminating exception was that the 



genes encoding mitochondrial ribosomal 
genes were generally induced rather than 
repressed after glucose limitation, high- 
lighting the requirement for mitchondrial 
biogenesis (13), As more is learned about 
the functions of every gene in the yeast 
genome, the ability to gain insight into a 
cell's response to a changing environment 
through its global gene expression patterns 
will become increasingly powerful. 

Several distinct temporal patterns of ex- 
pression could be recognized, and sets of 
genes could be grouped on the basis of the 
similarities in their expression patterns. The 
characterized members of each of these 
groups also shared important similarities in 
their functions. Moreover, in most cases, 
common regulatory mechanisms could be 
inferred for sets of genes with similar expres- 
sion profiles. For example, seven genes 
showed a late induction profile, with mRNA 
levels increasing by more than ninefold at 



the last timepoint but less than threefold at 
the preceding timepoint (Fig. 5B). All of 
these genes were known to be glucose-re- 
pressed, and five of the seven were previously 
noted to share a common upstream activat- 
ing sequence (UAS), the carbon source re- 
sponse element (CSRE) (J 6-20). A search 
in the promoter regions of the remaining two 
genes, ACR] and JDP2, revealed that 
ACRJ, a gene essential for ACSJ activity, 
also possessed a consensus CSRE motif, but 
interestingly, IDP2 did not. A search of the 
entire yeast genome sequence for the con- 
sensus CSRE motif revealed only four addi- 
tional candidate genes, none of which 
showed a similar induction. 

Examples from additional groups of 
genes that shared expression profiles are 
illustrated in Fig. 5, C through F. The 
sequences upstream of the named genes in 
Fig. 5C all contain stress response ele- 
ments (STRE), and with the exception 
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Fig. 1. Yeast genome microarray. The actual size of the microarray is 18 mm by 18 mm. The 
microarray was printed as described (9). This image was obtained with the same fluorescent 
scanning confocal microscope used to collect all the data we report (49). A fluorescently labeled 
cDNA probe was prepared from mRNA isolated from cells harvested shortly after inoculation (culture 
density of <5 x 10 6 cells/ml and media glucose level of 19 g/liter) by reverse transcription in the 
presence of Cy3-dUTP. Similarly, a second probe was prepared from mRNA isolated from cells taken 
from the same culture 9.5 hours later (culture density of -2 x 10 8 cells/ml, with a glucose level of 
<0.2 g/liter) by reverse transcription in the presence of Cy5-dUTP. In this image, hybridization of the 
Cy3-dlTTP-labeled cDNA (that is, mRNA expression at the initial timepoint) is represented as a green 
signal, and hybridization of Cy5-dUTP-tabeled cDNA (that is, mRNA expression at 9.5 hours) is 
represented as a red signal. Thus, genes induced or repressed after the diauxic shift appear in this 
image as red and green spots, respectively. Genes expressed at roughly equal levels before and after 
the diauxic shift appear in this image as yellow spots. 
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of HSP42, have previously been shown to 
be controlled at least in part by these 
elements (21-24). Inspection of the se- 
quences upstream of HSP42 and the two 
uncharacterized genes shown in Fig. 5C, 
YKL026c, a hypothetical protein with 
similarity to glutathione peroxidase, and 
YGR043c, a putative transaldolase, re- 
vealed that each of these genes also pos- 
sess repeated upstream copies of the stress- 
responsive CCCCT motif. Of the 13 ad- 
ditional genes in the yeast genome that 
shared this expression profile [including 
HSP30, ALD2, OM45, and 10 uncharac- 
terized ORFs (25)], nine contained one or 
more recognizable STRE sites in their up- 
stream regions. 

The heterotrimeric transcriptional acti- 
vator complex HAP2 t 3,4 has been shown 
to be responsible for induction of several 
genes important for respiration (26-28). 
This complex binds a degenerate consensus 
sequence known as the CCAAT box (26). 
Computer analysis, using the consensus se- 
quence TNRYTGGB (29), has suggested 
that a large number of genes involved in 
respiration may be specific targets of 
HAP2,3,4 (30). Indeed, a putative 
HAP2,3,4 binding site could be found in 
the sequences upstream of each of the seven 
cytochrome c-related genes that showed 
the greatest magnitude of induction (Fig. 
5D). Of 12 additional cytochrome c-related 
genes that were induced, HAP2 t 3 t 4 binding 
sites were present in all but one. Signifi- 
cantly, we found that transcription of 
HAP4 itself was induced nearly ninefold 
concomitant with the diauxic shift. 

Control of ribosomal protein biogenesis 
is mainly exerted at the transcriptional 
level, through the presence of a common 
upstream-activating element (UAS^ ) 
that is recognized by the Rapl DNA-bind- 
ing protein (31, 32). The expression pro- 
files of seven ribosomal proteins are shown 
in Fig. 5F. A search of the sequences 
upstream of all seven genes revealed con- 
sensus Rapl -binding motifs (33). It has 
been suggested that declining Rapl levels 
in the cell during starvation may be re- 
sponsible for the decline in ribosomal pro- 
tein gene expression (34). Indeed, we ob- 
served that the abundance of RAP I 
mRNA diminished by 4-4-fold, at about 
the time of glucose exhaustion. 

Of the 149 genes that encode known or 
putative transcription factors, only two, 
HAP4 and S1P4, were induced by a factor of 
more than threefold at the diauxic shift. 
S/P4 encodes a DNA-binding transcrip- 
tional activator that has been shown to 
interact with Snfl , the "master regulator" of 
glucose repression (35). The eightfold in- 
duction of S1P4 upon depletion of glucose 
strongly suggests a role in the induction of 



downstream genes at the diauxic shift. 

Although most of the transcriptional 
responses that we observed were not pre- 
viously known, the responses of many 
genes during the diauxic shift have been 
described. Comparison of the results we 
obtained by DNA microarray hybridiza- 
tion with previously reported results there- 
fore provided a strong test of the sensitiv- 
ity and accuracy of this approach. The 
expression patterns we observed for previ- 
ously characterized genes showed almost 
perfect concordance with previously pub- 
lished results (36). Moreover, the differ- 
ential expression measurements obtained 
by DNA microarray hybridization were re- 
producible in duplicate experiments. For 
example, the remarkable changes in gene 
expression between cells harvested imme- 
diately after inoculation and immediately 
after the diauxic shift (the first and sixth 
intervals in this time series) were mea- 
sured in duplicate, independent DNA mi- 
croarray hybridizations. The correlation 
coefficient for two complete sets of expres- 
sion ratio measurements was 0.87, and for 
more than 95% of the genes, the expres- 



sion ratios measured in these duplicate 
experiments differed by less than a factor 
of 2. However, in a few cases, there were 
discrepancies between our results and pre- 
vious results, pointing to technical limita- 
tions that will need to be addressed as 
DNA microarray technology advances 
(37, 38). Despite the noted exceptions, 
the high concordance between the results 
we obtained in these experiments and 
those of previous studies provides confi- 
dence in the reliability and thoroughness 
of the survey. 

The changes in gene expression during 
this diauxic shift are complex and involve 
integration of many kinds of information 
about the nutritional and metabolic state 
of the cell. The large number of genes 
whose expression is altered and the diver- 
sity of temporal expression profiles ob- 
served in this experiment highlight the 
challenge of understanding the underlying 
regulatory mechanisms. One approach to 
defining the contributions of individual 
regulatory genes to a complex program of 
this kind is to use DNA microarrays to 
identify genes whose expression is affected 



Fig. 2. The section of the ar- 
ray indicated by the gray box 
in Fig. 1 is shown for each of 
the experiments described 
here. Representative genes 
are labeled. In each of the ar- 
rays used to analyze gene 
expression during the diauxic 
shift, red spots represent 
genes that were induced rel- 
ative to the initial timepoint, 
and green spots represent 
genes that were repressed 
relative to the initial timepoint. 
In the arrays used to analyze 
the effects of the tup1& mu- 
tation and YAP1 overexpres- 
sion, red spots represent 
genes whose expression was 
increased, and green spots 
represent genes whose ex- 
pression was decreased by 
the genetic modification. Note 
that distinct sets of genes are 
induced and repressed in the 
different experiments. The 
complete images of each of 
these arrays can be viewed on 
the Internet {13). Cell density 
as measured by optical densi- 
ty (00) at 600 nm was used to 
measure the growth of the 
culture. 
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by mutations in each putative regulatoty 
gene. As a test of this strategy, we analyzed 
the genomewide changes in gene expression 
that result from deletion of the TUPl gene. 
Transcriptional repression of many genes by 
glucose requires the DNA-binding repressor 



Migl and is mediated by recruiting the tran- 
scriptional co-repressors Tupl and Cyc8/ 
Ssn6 (39). Tupl has also been implicated in 
repression of oxygen-regulated, mating-type- 
specific, and DNA-damage-inducible genes 
(40). 



Debranching 




Glycolysis/ 
gluconeogenesis 




^-1133 ■ ■ ^mm^^ 

Fig. 3. Metabolic reprogramming inferred from global analysis of changes in gene expression. Only key 
metabolic intermediates are identified. The yeast genes encoding the enzymes that catalyze each step 
in this metabolic circuit are identified by name in the boxes. The genes encoding succinyl-CoA synthase 
and glycogen-debranching enzyme have not been explicitly identified, but the ORFs YGR244 and 
YPR184 show significant homology to known succinyl-CoA synthase and glycogen-debranching en- 
zymes, respectively, and are therefore included in the corresponding steps in this figure. Red boxes with 
white lettering identify genes whose expression increases in the diauxic shift. Green boxes with dark 
green lettering identify genes whose expression diminishes in the diauxic shift. The magnitude of 
induction or repression is indicated for these genes. For multimeric enzyme complexes, such as 
succinate dehydrogenase, the indicated fold-induction represents an unweighted average of ail the 
genes listed in the box. Black and white boxes indicate no significant differential expression (less than 
twofold). The direction of the arrows connecting reversible enzymatic steps indicate the direction of the 
flow of metabolic intermediates, inferred from the gene expression pattern, after the diauxic shift. Arrows 
representing steps catalyzed by genes whose expression was strongly induced are highlighted in red. 
The broad gray arrows represent major increases in the flow of metabolites after the diauxic shift, 
inferred from the indicated changes in gene expression. 



Wild-type yeast cells and cells bearing 
a deletion of the TUP J gene (tupl A) were 
grown in parallel cultures in rich medium 
containing glucose as the carbon source. 
Messenger RNA was isolated from expo- 
nentially growing cells from the two pop- 
ulations and used to prepare cDNA la- 
beled with Cy3 (green) and Cy5 (red), 
respectively {11). The labeled probes were 
mixed and simultaneously hybridized to 
the microarray. Red spots on the microar- 
ray therefore represented genes whose 
transcription was induced in the tupl A 
strain, and thus presumably repressed by 
Tupl (41 )■ A representative section of the 
microarray (Fig. 2, bottom middle panel) 
illustrates that the genes whose expression 
was affected by the tupl A mutation, were, 
in general, distinct from those induced 
upon glucose exhaustion [complete images 
of all the arrays shown in Fig. 2 are avail- 
able on the Internet (13)]. Nevertheless, 
34 (10%) of the genes that were induced 
by a factor of at least 2 after the diauxic 
shift were similarly induced by deletion of 
TUPl, suggesting that these genes may be 
subject to TUP 1 -mediated repression by 
glucose. For example, SUC2, the gene en- 
coding invertase, and all five hexose trans- 
porter genes that were induced during the 
course of the diauxic shift were similarly 
induced, in duplicate experiments, by the 
deletion of TUPl . 

The set of genes affected by Tupl in this 
experiment also included a-glucosidases, 
the mating-type-specific genes MFA1 and 
MFA2, and the DNA damage-inducible 
RNR2 and RNK4, as well as genes involved 
in flocculation and many genes of unknown 
function. The hybridization signal corre- 
sponding to expression of TUPl itself was 
also severely reduced because of the (in- 
complete) deletion of the transcription unit 
in the tuplk strain, providing a positive 
control in the experiment (42). 

Many of the transcriptional targets of 
Tupl fell into sets of genes with related 
biochemical functions. For instance, al- 
though only about 3% of all yeast genes 
appeared to be TUPl -repressed by a factor 
of more than 2 in duplicate experiments 
under these conditions, 6 of the 13 genes 
that have been implicated in flocculation 
(15) showed a reproducible increase in 
expression of at least twofold when TUPl 
was deleted. Another group of related 
genes that appeared to be subject to TUPl 
repression encodes the serine-rich cell 
wall mannoproteins, such as Tipl and 
Tirl/Srpl which are induced by cold 
shock and other stresses (43), and similar, 
serine-poor proteins, the seripauperins 
(44). Messenger RNA levels for 23 of the 
26 genes in this group were reproducibly 
elevated by at least 2. 5 -fold in the tup J A 
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strain, and 18 of these genes were induced 
by more than sevenfold when TUPl was 
deleted. In contrast, none of 83 genes that 
could be classified as putative regulators of 
the cell division cycle were induced more 
than twofold by deletion of TUPL Thus, 
despite the diversity of the regulatory sys- 
tems that employ Tupl, most of the genes 
that it regulates under these conditions 
fall into a limited number of distinct func- 
tional classes. 

Because the microarray allows us to 
monitor expression of nearly every gene in 
yeast, we can, in principle, use this ap- 
proach to identify all the transcriptional 
targets of a regulatory protein like Tupl. It 
is important to note, however, that in any 
single experiment of this kind we can only 
recognize those target genes that are nor- 
mally repressed (or induced) under the 
conditions of the experiment. For in- 
stance, the experiment described here an- 
alyzed a MAT a strain in which MFAi 
and MFA2, the genes encoding the a- 
factor mating pheromone precursor, are 
normally repressed. In the isogenic tup J A 
strain, these genes were inappropriately 
expressed, reflecting the role that Tupl 
plays in their repression. Had we instead 
carried out this experiment with a MATA 
strain (in which expression of MFAJ and 
MFA2 is not repressed), it would not have 
been possible to conclude anything re- 
garding the role of Tupl in the repression 
of these genes. Conversely, we cannot dis- 
tinguish indirect effects of the chronic 
absence of Tupl in the mutant strain from 
effects directly attributable to its partici- 
pation in repressing the transcription of a 
gene. 

Another simple route to modulating the 
activity of a regulatory factor is to overex- 
press the gene that encodes it. YAP J en- 
codes a DNA-binding transcription factor 
belonging to the b-zip class of DNA-bind- 
ing proteins. Overexpression of YAP J in 
yeast confers increased resistance to hydro- 
gen peroxide, o-phenanthroline, heavy 
metals, and osmotic stress (45). We ana- 
lyzed differential gene expression between a 
wild-type strain bearing a control plasmid 
and a strain with a plasmid expressing YAP/ 
under the control of the strong GALl-10 
promoter, both grown in galactose (that is, 
a condition that induces YAP J overexpres- 
sion). Complementary DNA from the con- 
trol and YAP! overexpressing strains, la- 
beled with Cy3 and Cy5, respectively, was 
prepared from mRNA isolated from the two 
strains and hybridized to the microarray. 
Thus, red spots on the array represent genes 
that were induced in the strain overexpress- 
ing YAPL 

Of the 17 genes whose mRNA levels 
increased by more than threefold when 



YAP1 was overexpressed in this way, five 
bear homology to aryl-alcohol oxidoreduc- 
tases (Fig. 2 and Table 1). An additional 
four of the genes in this set also belong to 
the general class of dehydrogenases/oxi- 
doreductases. Very little is known about 
the role of aryl-alcohol oxidoreductases in 
S. cerevisiae, but these enzymes have been 
isolated from ligninolytic fungi, in which 
they participate in coupled redox reac- 
tions, oxidizing aromatic, and aliphatic 
unsaturated alcohols to aldehydes with the 
production of hydrogen peroxide (46, 47). 
The fact that a remarkable fraction of the 
targets identified in this experiment be- 
long to the same small, functional group of 
oxidoreductases suggests that these genes 

Fig. 4. Coordinated reg- 
ulation of functionally re- 
lated genes. The curves 
represent the average in- 
duction or repression ra- 
tios for all the genes in 
each indicated group. 
The total number of 
genes in each group was 
as follows: ribosomal 
proteins, 112; translation 
elongation and initiation 

factors, 25; tRNA synthetases (excluding mitochondial synthetases), 1 7; glycogen and trehalose syn- 
thesis and degradation, 15; cytochrome c oxidase and reductase proteins, 19; and TCA- and glyoxy- 
late-cycle enzymes, 24. 

Table 1 . Genes induced by YAP1 overexpression. This list includes alt the genes for which mRNA levels 
increased by more than twofold upon YAP1 overexpression in both of two duplicate experiments, and 
for which the average increase in mRNA level in the two experiments was greater than threefold (50). 
Positions of the canonical Yap1 binding sites upstream of the start codon, when present, and the 
average fold-increase in mRNA levels measured in the two experiments are indicated. 



might play an important protective role 
during oxidative stress. Transcription of a 
small number of genes was reduced in the 
strain overexpressing Yapl. Interestingly, 
many of these genes encode sugar per- 
meases or enzymes involved in inositol 
metabolism. 

We searched for Yapl -binding sites 
(TTACTAA or TGACTAA) in the se- 
quences upstream of the target genes we 
identified (48). About two-thirds of the 
genes that were induced by more than 
threefold upon Yapl overexpression had 
one or more binding sites within 600 bases 
upstream of the start codon (Table 1), sug- 
gesting that they are directly regulated by 
Yapl. The absence of canonical Yapl-bind- 
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ing sices upstream of the others may reflect 
an ability of Yapl to bind sites that differ 
from the canonical binding sites, perhaps in 
cooperation with other factors, or less like- 
ly, may represent an indirect effect of Yapl 
overexpression, mediated by one or more 
intermediary factors. Yapl sites were found 
only four times in the corresponding region 
of an arbitrary set of 30 genes that were not 
differentially regulated by Yapl. 

Use of a DNA microarray to character- 
ize the transcriptional consequences of 
mutations affecting the activity of regula- 
tory molecules provides a simple and pow- 
erful approach to dissection and character- 
ization of regulatory pathways and net- 



works. This strategy also has an important 
practical application in drug screening. 
Mutations in specific genes encoding can- 
didate drug targets can serve as surrogates 
for the ideal chemical inhibitor or modu- 
lator of their activity. DNA microarrays 
can be used to define the resulting signa- 
ture pattern of alterations in gene expres- 
sion, and then subsequently used in an 
assay to screen for compounds that repro- 
duce the desired signature pattern. 

DNA microarrays provide a simple and 
economical way to explore gene expres- 
sion patterns on a genomic scale. The 
hurdles to extending this approach to any 
other organism are minor. The equipment 



required for fabricating and using DNA 
microarrays (9) consists of components 
that were chosen for their modest cost and 
simplicity. It was feasible for a small group 
to accomplish the amplification of more 
than 6000 genes in about 4 months and, 
once the amplified gene sequences were in 
hand, only 2 days were required to print a 
set of 110 microarrays of 6400 elements 
each. Probe preparation, hybridization, 
and fluorescent imaging are also simple 
procedures. Even conceptually simple ex- 
periments, as we described here, can yield 
vast amounts of information. The value of 
the information from each experiment of 
this kind will progressively increase as 
more is learned about the functions of 
each gene and as additional experiments 
define the global changes in gene expres- 
sion in diverse other natural processes and 
genetic perturbations. Perhaps the greatest 
challenge now is to develop efficient 
methods for organizing, distributing, inter- 
preting, and extracting insights from the 
large volumes of data these experiments 
will provide. 
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ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 
the scop database [Murzin, A. G., Brenner, S. E., Hubbard, T. 
& Chothia C. (1995) /. MoL Biol. 247, 536-540]. The evalua- 
tion tested the programs blast [Altschul, S. F., Gish, W., 
Miller, W., Myers, E. W. & Lipman, D. J. (1990)./ MoL Biol. 
215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods Enzymol. 266, 460-480], fasta [Pearson, W. R. & 
Lipman, D. J. (1988) Proc. Nail. Acad. Sci. USA 85, 2444-2448] , 
and ssearch [Smith, T. F. & Waterman, M. S. (1981) J. MoL 
BioL 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E-value statistical scores of ssearch and fasta are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by blast and wu-blast? exaggerate significance by orders of 
magnitude, ssearch, fasta ktup = 1, and wu-blast2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 

Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 
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Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modern versions 
of programs commonly used. For example, parameters in 
blast (1) have changed, and WU-BLAST2 (2)— which produces 
gapped alignments— has become available. The latest version 
of fasta (3) previously tested was 1.6, but the current release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: 
in an absolute sense, how well does pairwise sequence com- 
parison work? That is, what fraction of homologous proteins 
can be detected using modern database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
scop: Structural Classification of Proteins database (4), which 
is derived from structural and functional characteristics (5). 
The scop database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in SSEARCH (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have provided blast (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is fasta (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = 1). 
Pearson also considered different parameters for each of these 
programs. 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
pir database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of pir 

Abbreviation: EPQ, errors per query. 
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superfamilies. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith-Waterman algorithm worked 
slightly better than fasta, which was in turn more effective 
than blast. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and fasta. Their test with blast 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-prot database (12) and used prosite (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs' evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in pir and prosite are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependent of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but pir places them in different superfamilies. 
The problem is widespread: each superfamily in pir 48.00 with 
a structural homolog is itself homologous to an average of 1.6 
other pir superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the hssp equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-vaJues) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the blast program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the blast 
algorithm" (1). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function it 



is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the scop database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The SCOP database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From SCOP, we extracted the sequences of domains of 
proteins in the Protein Data Bank (pdb) (30) and created two 
databases. One (PDB90D-B) has domains, which were all <90% 
identical to any other, whereas (PDB40D-B) had those <40% 
identical. The databases were created by first sorting all 
protein domains in SCOP by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains 1,323 domains, which have 9,044 ordered pairs of 
distant relationships, or <*0.5% of the total 1,749,006 ordered 
pairs. In PDB90D-B, the 2,079 domains have 53,988 relation- 
ships, representing 1.2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the SEG program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are available from http://sss.stanford.edu/ 
sss/, and databases derived from the current version of SCOP 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy overrepresentation in the pdb of a small number of 
families (31, 32), whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested blast (1), version 1.4.9MP, and wu- 
BLAST2 (2), version 2.0al3MP. Also assessed was the fasta 
package, version 3.0t76 (3), which provided fasta and the 
ssearch implementation of Smith-Waterman (8). For 
ssearch and fasta, we used BLOSUM45 with gap penalties 
-12/-1 (7, 16). The default parameters and matrix (blo- 
SUM62) were used for blast and wu-blast2. 

The "Coverage Vs. Error" Plot. To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 
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perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
cover Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 
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Fig. 2. Unrelated proteins with high percentage identity. Hemo- 
globin /3-cham (pdb code lhds chain b, ref. 38, Left) and cellulase E2 
(pdb code ltml, ref. 39, Right) have 39% identity over 64 residues a 
level which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the E-value of 1.3 is significant. Proteins rendered bv 
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Fig. 3. Length and percentage identity of alignments of unrelated 
proteins in pdbmd-b: Each pair of nonhomologous proteins found with 
ssearch is plotted as a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity 
The line shows the hssp threshold (though it is intended to be applied 
with a different matrix and parameters). 
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Fig. 4. Reliability of statistical scores in pdb90D-b: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E-values are reported for ssearch and 
fasta, whereas P-values are shown for blast and wu-blast2. If the 
scoring were perfect, then the number of errors per query and the 
E-values would be the same, as indicated by the upper bold line 
(P-values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from 
ssearch and fasta are shown to have good agreement with EPQ but 
underestimate the significance slightly, blast and wu-blast2 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for PDB40D-B were similar to those for pdbwd-b 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In blast, a measure 
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related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35), there is a common 
rule-of-thumb stating that 30% identity signifies homology. 
Moreover, publications have indicated that 25% identity can 
be used as a threshold (17, 36). We find that. these thresholds, 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the PDB90D-B analysis in Fig. 3, we learn that 30% 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 43.5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length, 
then a negligible number of distant homologs are detected. 
Use of the hssp equation improves the value of percentage 
identity, but even this measure can find only 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1), but In-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 
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likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not only powerful, but also 
easy to interpret ssearch and fasta show close agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se- 
quences being found at random in a given query. Thus an 
E-value of 0.01 indicates that roughly one pair of nonhomoiogs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from blast also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less these results strongly suggest that the analytic theory is 
fundamentally appropriate. wu-blast2 scores were more re- 
liable than those from blast, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ 
Overall Detection of Homologs and Comparison of Algo- 
nthms. The results in Fig. SA and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B 
Even ssearch with E-values, the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. blast, which 
identifies 15%, was the worst performer, whereas fasta 
ktup - 1 is nearly as effective as ssearch. fasta ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower, ssearch is 25 times slower than blast and 6 5 times 
slower than fasta ktup = 1. wu-blast2 is slightly faster than 
fasta ktup = 2, but the latter has more interpretabie scores 
In PDB90D-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. 5B). The method which finds that many 
relationships is wu-blastz Consequently, we infer that the 
differences between fasta kup = 1, ssearch, and wu-blast2 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance, ssearch with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-vaiues, but 26 of these involve sequences with <50 
residues Of sequences having 25-30% identity, 75% are 
identified by ssearch E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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Fig. 6. Distribution and detection of homologs in PDB40D-B. Bars 
show the distribution of homologous pairs pdbwd-b according to their 
identity (using the measure of identity in both). Filled regions indicate 
the number of these pairs found by the best database searching method 
(ssearch with E-values) at 1% EPQ. The pdwod-b database contains 
proteins with <40% identity, and as shown on this graph, most 
structurally identified homologs in the database have diverged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at low levels of identity Filled 
regions show that ssearch can identify most relationships that have 
25% or more identity, but its detection wanes sharply below 25% 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of 
panwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
blast was released: BLASTGP (37). It supports gapped align- 
ments, like WU-BLAST2, and dispenses with sum statistics. Our 
initial tests on blastgp using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped blast, but not 
quite equal to that of wu-blast2. 

CONCLUSION 

The general consensus amongst experts (see refs. 7, 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (/) using a large current database 
in which the protein sequences have been complexity masked 
and (u) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by fasta and ssearch give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by blast and wu-blast2 underestimate the true 



Table 1. Summary of sequence comparison methods with PDB40D-B 


Method 


Relative Time* 


1%EPQ Cutoff 


Coverage at 1% EPQ 


ssearch % identity: within alignment 
ssearch % identity: within both 
ssearch % identity: HSSP-scaled 
ssearch Smith-Waterman raw scores 
ssearch E-values 
fasta ktup = 1 E-values 
fasta ktup = 2 E-values 
WU-BLAST2 P-values 
blast P-values 


25.5 
25.5 
25.5 
25.5 
25.5 
3.9 
1.4 
1.1 
1.0 


>70% 
34% 
35% (HSSP + 9.8) 
142 
0.03 
0.03 
0.03 
0.003 
0.00016 


<0.1 
3.0 
4.0 
10.5 
18.4 
17.9 
16.7 
17.5 
14.8 


*Times are from large database searches 


with genome proteins. 
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extent of errors. Second, SSEARCH, WU-BLAST2, and fasta 
ktup = 1 perform best, though blast and fasta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.** 

Additional and updated information about this work, including 
supplementary figure s, may be found at http://sss.stanford.edu/sss/. 
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A two-dimensional gel database of rat Dver proteins 
useful in gene regulation and drug effects studies 

^!^i*™v ' W0 " dimensionai (2 * D) P rotein ma P of F«cher 344 rat liver 
(F344MST3) is presented, with a tabular listing of more than 1200 protein species 
?M m u m , d l dCCy J SUlfalC {SDS) molccu ^mass and isoelectric point have been es^ 
tabiisned. based on positions of numerous internal standards. This map has been 
used to connect and compare hundreds of 2-D gels of rat liver samples from a va- 
nety of studies, and forms the nucleus of an expanding database describing rat 
hver proteins and their regulation by various drugs and toxic agents. An example 
of such a study, involving regulation of cholesterol synthesis by cholesterol-Kr- 
ing. drugs and a high-cholesterol diet, is presented. Since the map hal Seen Ob- 
tained with a widely used and highly reproducible 2-D gel system (the bo-Dalt* 
system), it can be directly related to an expanding body of work in other laborato- 
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1 Introduction 

High-resoluuon two-dimensional electrophoresis of pro- 
teins, introduced in 1975 by O'FarreU and others [1— *]. has 
been used over the ensuing 16 years to examine a wide va- 
riety of biological systems, the results appearing in more 
than 5000 published papers. With the advent of computer- 
ized systems for analyzing two-dimensional (2-D) gel ima- 
ges and constructing spot databases, it is also possible to 
plan and assemble integrated bodies of information de- 
scribing the appearance and regulation of thousands of pro- 
tein gene products [5, 6). Creating such databases involves 
amassing and organizing quantitative data from thousands 
of 2-D gels, and requires a substantial commitment in tech- 
nology and resources. 

Given the long-term effort required to develop a protein da- 
tabase, the choice of a biological system takes on consider- 
able importance. While in vtrro systems are ideal foranswer- 
mg many experimental questions, especially in cancer re- 
search and genetics, our experience with cell cultures and 
tissue samples suggests that some m vivo approaches could 
have major advantages. In particular, we have noticed that 
liver tissue samples from rats and mice appear to show grea- 
ter quantitative reproducibility (in terms of individual pro- 
tein expression) than replicate cell cultures. This is perhaps 
a natural result of the homeostasis maintained in a com- 
plete animal vs. the well-known variability of cell cultures, 
the latter due principally to differences in reagents (e.gW. 
fetal bovine serum ). conditions * e.-.. pH > and genetic "evo- 
lution" of cell lines while in culture. It is also more difficult 
to generate adequate amounts of protein from cell culture 
systems (particularly with attached cells), forcing the inves- 
tigator to resort to radiotsotope-based or silver-based stain- 
detection methods. While these methods are more sensi- 
tive (sometimes much more sensitive) than the Coomassie 
Brilliant Blue (CBB) stain typically used for protein detec- 
tion in -large" protein samples, they are generally more vari- 
able, more labor-intensive and. in the case of radiographic 
methods, may generate highJy "nois>" images, due to the 
properties of the films used. By contrast, large protein sam- 
ples can easily be prepared from liver using urea/Nonidet 
P-40 (NP-40) solubilization and stained with CBB, which 
has the advantage of being easily reproducible (8|. Finally, 
there remains the question of the "truthfulness" of many in 
vitro systems as compared to their in vivo analogs; how 
great are the changes caused by the introduction into a cul- 
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turc and the associated shift to strong selection for growth 
and how do these affea experimental outcomes? Hence 
the apparent advantages of in vitro systems, in terms of ex- 
perimental manipulation, may be counterbalanced bv 
other factors relating to 2-D data quality. 

There is a second important class or reasons for exploring 
the use of an in vivo biological system such as the liver His- 
torically, there have been two broad approaches to the me- 
chamstic dissection of biochemical processes in intact cel- 
lular systems: genetics (a search for informative mutants) 
and the use of chemical agents (drugs and chemical toxins) 
Both approaches help us to understand complex svstems 
by disrupting some specific functional element and show, 
mg us the result. With the development of techniques for 
genetic manipulation and cloning, the genetic approach 
can be effectively applied either in vitro or in vivo, although 
the in vitro route is usually quicker. The chemical approach 
can also be applied to either son of biological svstem; here 
however, the bulk of consistently acquired information is 
m experimental animals <ms and mice). While most biolo- 
gists know a short list of compounds having specific, experi- 
mentally useful effects^., inhibitors of protein synthesis 
lonophores, polymerase inhibitors, channel blockers, nu- 
cleotide analogs, and compounds affecting polymerization 
orcytoskeletal proteins), there is a much larger number of 
interesting chemically-induced efTecis. most of them char- 
acterized by loxicologists and pharmacologists in rodent 
systems. Just as a thorough genetic analysis would involve 
saturating a genome with mutations, it is possible to ima- 
gine a saturating number of drugs, the analysis of whose ac- 
turns would reveal the complete biochemistry of the cell. 
While organized drug discovery efTons usually target spe- 
cific desired efTecis. the nature of the process, with its de- 
pendence on screening large numbers of compounds, ne- 
cessarily produces many unanticipated effects, h is there- 
fore reasonable to suppos; that the required broad ranee of 
compounds necessary 10 achieve -biochemical saturation" 
may be forthcoming; in faci. it may already exist among the 
hundreds of thousands of compounds that failed to qualify 
as drugs. 



Among organs, the liver is an obvious choice for the study 
of chemical effects because of its well-known plasticity and 
responsiveness. The brain appears to be quite plastic (e.g. 
I-nr " 1S 2 com P Iicalcd mixture of cell types requiring 
skillful dissection for most experiments. The kidnev. while 
quite responsive, also presents a potentially confounding 
mixture of cell types. The liver, by contrast, is made up of 
one predominant cell type which is easy to solubilize: the 
hepatocyte, representing more than 95% of its mass. Most 
importantly, the liver performs many homeostatic func- 
tions that require rapid modulation of gene expression. It 
appears that most chemical agents tested afTect gene ex- 
pression in the liver at some dosage (N. Leigh Anderson, 
unpublished observations), an interesting contrast to our 
earlier work with lymphocytes, for example, which seem to 
be much less responsive. Such results conform to the expec- 
tauon that cells with a homeostatic, physiological role 
sftould be more plastic than cells differentiated for a pur- 
pose dependent on the action of a limited number of spe- 
cific genes. 

The liver also allows the parallels between in vitro and in 
v/vo systems to be examined in detail. Significant progress 



h* been made in the development of mous- r .. 
man hepatoc>ie culture s«ienu.as well as in n/ * : * 
tissue slices. Using such an a^^ni ue's h ^ 
ble to assemble a matrix of muiw^^^ ' - 
mouse and rat in vivo on one level and mouse ri f" 
man in vitro on a second level, and to compart eff!l 
tween species and between systems. This approa£ ' * 
us to drawmfonned conclusions regardine the bioS^" 
universahty- of biological responses amone the 2T 
and to offer some insight into the validitC of 
preaches for toxicological screening. We beli-vj ,„""" 
will be necessary if /„ Ww alternatives are to ach.^J ^ 
usage in govemmeni-mandated safety testinc of dm ' 
sumer products and industrial and agricultural 

A number of interesting studies have been published ... 
2-D mapping to examine effects in the rodent J v cr a 
her of mvesngarors have made use of the V«S^" Ur * 
screen for existing genetic variants [S-11J or inSd " 
tions [12 7 14).mainly in the mouse .This work buSt^ 
wealth of genetic information available on th?m2?i! 
us established position as a mammalian mmaUo^ 
tion system. While some studies of chemical effects £ V 

rat nr?^ 0 m lhC m ° USC I15 " I? J- m ™ have used^ 
rat [18-23]. The examination of the cytochrome dJs * 

These considerations lead us to conclude that rodent iiv~ 
offers the best opportunity to systematically examine v 
array of gene regulation systems, and ultimaielv to bu Id a 

^ ° f , ? rgMCa,C mamma ^ g«e comm. 
The basic underlying foundation of such a project is a reli- 
ablc-reproducible master 2-D pattern of liver, to whii on- 
going experimental results can be referred. In this paper we 
report such a master pattern for the acidic and neutral p«y 
teinsofratl>ver(patternF344MST3nnfuture.this master 
will be supplemented by maps of basic proteins.and analog- 
ous maps of mouse and human liver. 



2 Materials and methods 
2.1 Sample preparation 

Liver is an ideal sample material formost biochemical stud- 
ies, including 2-D analysis. A sample is taken of approxima- 
tely 0.5 g of tissue from the apical end of the left lobe of the 
liver. Solubilization is effected as rapidly as practical; a 
delay of 5-15 min appears to cause no major alteration in 
liver protein composition if the liver pieces are kept cold 
(e.g., on ice) in the interim. In the solubilization process, 
the liver sample is weighed, placed in a glass homogenize' 
(e.g., 15 mL Wheaton); 8 volumes of solubilizing soluuoo* 

• The $olubili2ing solution is composed of 2% NP-40 (Sigma). 9 m urea 
(analyt.cal grade. , x . BDH or Bio-Rad), 0.5% dithiothreitol <DTT: 
Sigma) and 2% carrier ampholytes (pH 9-1 1 LKB: these come as a 
stock solution, so 2 % final concentration is achieved by malting ine 
solution 10% 9-11 Ampholine by volume). A large batch of solubiW* 
(several hundred mL) is made and stored frozen at -80 °C in aliq**" 
sufficient to provide enough for one day s estimated simple preP*** 
lion requirement. The solution is never allowed to become war** 
than room temperature at any suge during preparation or ihJ*"* 
use. since heating of concentrated urea solutions can produce 
nams that covalently modify proteins producing anifacioal 
shifts. Once thawed, any unused solubilizer is discarded. 



I (i.e., 4 mL per 0 J g tissue) ind the mixture is ho- 
r red using first the loose- and then then the tight-fit- 
j glass pestle. This takes approximately 5 strokes with 
4 pestle and is carried out at room temperature because 
a would crystallize out in the cold.Once the liversample 
thoroughly homogenized in the solubilizer. it is assumed 
itaJl the proteins are denatured (by the chaotropic effect 
the urea and NP-40 detergent) and the enzvmes inacti- 
led by the high pH (-9 J). Therefore these samples may 
; tept at room temperature until they can be centrifuged 
frozen as a group (within several hours of preparation). 
ie samples are centrifuged for 6 X 1 0* * min (e.g.. 500 000 
gfor 12 min using a Beckman TL-100 centrifuge). The 
jitrifuge rotor is maintained at just below room tempera- 
je (e.g- 15-20°C), but not too cold, so as to prevent the 
rripitttion of urea. The centrifuge of choice is a Beckman 
LrlOO because of the sample tube sizes available, but any 
itracentrifuge accepting smallish tubes will suffice. When 
l appropriate centrifuge is not available near the site of 
imple preparation, samples can be frozen at -80 *C and 
lawed prior to centrifugation and collection of superna- 
iits.Each supernatant is carefully removed following cen- 
iftjgation and aliquoted into at least 4 clean tubes forstor- 
ge.This is done by transferring all the supernatant to one 
lean tube, mixing this gently do assure homogeneous 
omposition) and then dividing it into 4 aliquot*. The ali- 
uots are frozen immediately at -80*C. These multiple ali- 
uotscan provide insurance against a failed run 0:2 freezer 
reakdown. 

12 Two-dimensional electrophoresis 
$ 

*mple proteins are resolved by 2-D electrophoresis using 
he 20 X 25 cm Iso-Dalt* 2-D gel system ([26-29]; pro- 
ceed by LSB and by Hoefer Scientific Instruments, San 
-rancisco) operating with 20 gels per batch. All first-dimen- 
aonal isoelectric focusing (IEF) gels are prepared using the 
ame single standardized batch of carrier ampholytes 
JDH 4-8A in the present case, selected by LSB's batch- 
*Jtog program for rat and mouse database work**). A 10 
uL sample of solubilized liver protein is applied to each gel 
and the gels are run for 33 000 to 34500 volt-hours using a 
Progressively increasing voltage protocol implemented by 
^programmable high-voltage power supply. An'Ange- 
mt~ computer-controUed gradient-casting system (pro- 
ceed by LSB) is used to prepare second-dimensional sod- 
m dodecyl sulfate (SDS) polyacrylamide gradient slab 
torn which the top 5% of the gel is 1 1 %T acrylamide, and 
ge lower 95 % of the gel varies linearly from 1 1 % to 1 8 %T 
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fijis system has recently been modified so as to employ a 
Swnmcrcially available 30.8 %T acryiamide/A',A p -methyle- 
Wbisacrylamide prepared solution (thus avoiding the han- 
ging of the solid acrylamide monomer) and three addi- 
tional stock solutions: bufTer (made from Sigma pre-set 
ijW. persulfate and - W.^AMetramethvlethylenedi- 
.guac (TEMED). Each gel is identified by 'a computer- 
Sited filter paper label polymerized into the lower left cor- 
ggpf the gel. First-dimensional IEF rube gels are loaded 

* mtierul (succeeding certified bitches of which are aviiUble from 
Hoerer Scientific Instruments) has the most linear pH gradient pro- 
duced by any ampholyte tested except for the Pharmacia wide range 
phieh has an unacceptable tendency to bind high-molecular weight 
ttdic proteins, causing them to streak). 



directly (as extruded) onto the slab gels without equilibra- 
tion,, and held in place by polyester fabric wedges (Wed- 
gies f produced by LSB) to avoid the use of hot agarose 
Second-dimensionaJ slab gels are run overnight, in groups 
of 20, in cooled DAU tanks (10'C) with buffer circulation. 
All run. parameters, reagent source and lot information, 
and notations of deviation from expected results are ente- 
red by the technician responsible on a detailed, multi-pace 
record of the experiment. 

13 Staining 

Following SDS-electrophoresis. slab gels are stained for 
protein using a colloidal Coomassie Blue G-250 procedure 
in covered plastic boxes, with 10 gels (totalling approxima- 
tely 1 L of gel) per box. This procedure (based on the work 
of NeuhofT[30,31]) involves fixation in 1.5L of 50% etha- 
nol and 2% phosphoric acid for2h. three 30 min washes, 
each in 2 L of cold up water, and transfer to 1 .5 L of 34 % 
methanol, 1 7% ammonium sulfate and 2 % phosphoric acid 
for 1 h. followed by the addition of a gram of powdered Coo- 
massie Blue G-250 stain. Staining requires approximate! v 4 
days to reach equilibrium intensity, whereupon gels are 
transferred to cool up water and their surfaces rinsed to re- 
move any paniculate stain prior to scanning. Gels mav be 
kept for several months in water with added sodium aiide. 
The water washes remove ethanol that would dissolve the 
stain (and render the system noncolloidal. with hieh back- 
grounds). The concentrated ammonium sulfate and meth- 
anol solution is diluted by equilibration with the water vol- 
ume of the gels to automatically achieve the correct fmaJ 
concentrations for colloidal staining. Practical advantages 
of this staining approach can be summarized as follows: (i) 
the low, flat background makes computer evaluation of 
small spots (max OD < 0.02) possible, especially when 
using laser densitometry; (ii) up to 1500 spots can be reli- 
ably detected on many gels (e.g., rat liver) at loadings low 
enough to preserve excellent resolution; and (iii) reprodu- 
cibility appears to be very good: at least several hundred 
spots have coefficients of reproducibility less than 15%. 
This value is at least as good as previous CBB methods, and 
significantly better than many silver stain systems. 

2.4 Positional standardization 

The carbamylated rabbit muscle creatine phosphokinase 
(CPK) sundards (32) are purchased from Pharmacia and 
BDH. Amino acid compositions, and numbers of residues 
present in proteins used for internal standardization, are 
uken from the Protein Identification Resource (PIR) se- 
quence database [33]. 



2 .5 Computer analysis 

Stained slab gels are digitized in red light at 134 micron re- 
solution, using either a Molecular Dynamics laser scanner 
(with pixel sampling) or an Eikonix 78/99 CCD scanner. 
Raw digitized gel images are archived on high-density DAT 
Upe (or equivalent storage media) and a greyscale video- 
print prepared from the raw digital image as hard-copy 
backup of the gel image. Gels are processed using the Kep- 
ler* software system (produced by LSB), a commercially 
available workstation-based software package built on 



some of the Peoples of the earlier TYCHO system f34- 
41). Procedure PROC008 is used to yield a spoSrvh* 
posmon. shape and density information for each deEcted 
spot. This procedure makes use of digital filtering, mathe- 
matical I mombology techniques and digiul masking , 0 re- 
move the background, and uses full 2-D least-squares omi- 
mization 10 refine the parameters of a 2-D Gaussian shape 
for each spot Processing parameters and file locations arc 
stored in a relational database, while various log files detail- 
ing operation of the automatic analysis software are a - 

SS .7? lhC redUCed d4U - The com P uted resolution an" 
1«h w 0 ^?" ""^IMee of each gel are inspected 
and archived for quality control purposes. aspeci " 

n^S*?? P acka * e$ « constructed using the Kepler ex- 
penment defimuon database to assemble groups of 2-D 
, P ~.r? corr "P° Ddin ? » experimental groups (eg 
treated and control animals). Each 2-D pattern is mat J/d 

F^Jmc-t^T 316 ' master " 2 " D P a « e ™ (Pattern 
F344MST3 tn the case of Fischer 344 rat liver), tbereb? 

EStS ^ l ° rodem Protein 2-D "au- 

d-^rV 0 ^*" aUows "Perimenu containing bun- 
t , , n f n gelS , t0 J be «««n«aed and analyzed as a unit, with 
Z l Z™ iaplayti 00 screen at one time fo com- 
parative purposes and multiple pages to accommodate ex- 

shoeing sign.ficant quantitative differences vs. appropriate 
controls are selecied using group-wise suusiu-aT P a?a me . 
ters (e.g.. Student s t-test. Kepler* procedure STUDENT) 

nnnf m /ff USfyiDg , Various 0"*"^^ criteria (such as P< 
0.001 difference from appropriate controls) are repre- 

«T=™ b,gWighlCd SP01S 0DSCreen ° r ° D computer p* . 
r!f v P .« , mapS 211(1 Stored 85 s P° l Populations (,.e., logi- 
EL, « } 10 3 bvcr P rote «» database. Quantitative data 
(spot parameters, statistical or other computed values) are 

IZtZ rCaJ ' VaJ J ,Cd VCC,0R iD the Cubase. Analysis 0 f £ 
clrS?.S, B /^^ 0nned " Sing 3 Pierson PWduci-momem 
whether Ll^iV Pr0 " dUre C0RRE D to determine 
Invif fh/f ? ° f Pr c lCms "* coordinately regulated by 
any of the treatments. Such groups can be presented graphi- 

t?5cri,^I° lem , ,naP ' and reponed ,0 * cther with tfacsutis- 
tical criteria used to assess the level of coregulation Multi- 

S^Err (f *• PriBCipaJ c ^Poncn«s- ana' 
lysis; is performed on data exported to S AS (SAS Institute). 



"'"•vi./A,^. 

ceuticals.ground and mixed with t h. a- 
of 0.075* and 1 %. respeaivHv £ s 6 ? 1 COnc <ntr aiIO 
was Purina 5801M-A 5% ^ h,gh *^ie2 > 
lateinthecontroldi^f^ 

etiological Associates (BethescS mST™ 0ui *ni 
c.imatized r forone week on SE^%£^+*i 
trol diets for one week and «r^~!. ed,es, °rcon 

groups were 37 mg/kg/daJaTd * g TSSS^T^ 
based on the weight of the food consumed t" ; SPec,,v «'» 
were coHected and prepared fo^.^Tro "h^ ■ ,tw «« 
mg to the standard liver protocol rh 0 mo P res,Sacc or(). 
volumes of 9 m urea. 2% NP^oX « « 

LKB pH 9-11 carrier ampholytes iluLt '' h ' othr e«ol. 2 u 
tion for 30 min at 8O0T0 X Tk<E£T Cen,rifu M- 
samples were frozen. Gels were mn h™' 11 3,1(1 
and the data was analyzed uS t£e K i B ,?« iCnbed ab <»: 
were scaled, to remove the 3£ of iff! enV™'" 1 - Ge,s 
loading, by setting the summed 52I? 0 S" Pr ° ,e,n 
ber of matched spots equal for e^^SSj"" 

3 Results and discussion 
3.1 The rat liver protein 2-D map 



2.6 Graphical dau output 

S i £ B !£ uks prepared in GKS and lrans,ated 

SedJS? ,D, ° ° UIpm for of a varie,v of devices 
rtnt!?™ *LTT ? ,ypicaI,y pre P ared « Posts cnpt and 
St h, i J° APPlC Uscrwrit «- Detailed maps presented 
fS.2E? " 8C w Craled Usin * an ""ra-high-resolut on 
Postscnpt-compatible Linotronic output device. Greyscal" 

mSSS^^ the ******* screening 
a Seikosha videopnnter. Patterns are shown in the standard 

2.7 Experiment LSBC04 

mai? < U S S dy a . deSCribCd h " e 12 * w «k-old Charles River 
h «ts were used. Diets were prepared at LSB 
based on a Punna 5755M Basal Purified Diet. LovasS 
and cholestyramme were obtained as prescription pntrma- 



2^VJSS2 s«a" n pa S D 0 ^ rat ,iverprot - 

from a sing.e 2-D gel and^nsW S 
ment comparing it to a range of nrotVin i„!^ expen " 
elude both smalfspots andtell?resSrl n ' *° * 10 ,n - 

^h-abundanc eiu-Morrte^rSSSK 
nave been matched 10 F344MS-n i« * r ^ Pa«ern$ 

wh a "ha; e aI v tCrn Hi MOre ,h3n 1200 'potsa5tnc uded mo 0 
which a re v«,ble on typ.cal gels loaded with lOuLofSuS- 

»a S e 5 Kff red by the ««*rd SoTand 
ben fM^K'ti h ! Coomassie Blue. Master spot num- 
Jea ,rf?h. fo. C r" 11 aSSigned ,0 3,1 P«>teins. and 

low ' m3S$ ? quadram - F *-5 the lower left (acid.c. 

E« ,Sh"1 maSS) 9Uadram - ^ e auad «nts over- 

100 »ov,ng between them. The gel position (in 

, e l^ n / , u " ,u J-' soelc «ric point (relative to the CPKin- 
KS« /Standards) a D nd SDS molecularmass (from theoli- 
Sl, «r T e F ' 8 - 8) 3 / C ,is,cd for each SPO' (Table 1). Be- 
S! ™ ?* Pf l C,S,0n ° f 1116 cp K'P' values, these para^l^ 
ters can be used to relate spot locations between gel s.w 
terns more rehably than using pj measurements expressed 
as pH. A major objective of current studies is the identifio- 
.on of all major spots corresponding to known liver pro- 
teins, as well as rigorous definitions of subcellular orgi- 
nelle contents. Of particular interest to us is the parallel de- 
velopment of identifications in the rat and mouse liver 
maps.allow.ng detailed comparisons of gene expression ef- 
fects in the two systems. The results of these studies will be 
presented systematically io a later edition of this database. 
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fwc include here a useful series of 22 orienting identiB- 
fimsas an aid to other users of the rat liver pattern (Table 

f: 

2, Cart im via led chute standards, computed p/s and 
' molecular mass standardization 

x 

'ebave previously shown that the use of a system of close- 
-spaced internal pi markers (made by carbamylating a 
isic protein) offers an accurate and workable solution to 
it problem of assigning positions in the pi dimension [32]. 
be same system, based on 36 protein species made by car- 
imylaiing rabbit muscle CPK. bas been used here to as- 
■gn pfs to most rat liver acidic and neutral proteins. The 
tandards were coelectropboresed with total liver proteins, 
nd the standard spots added to a special version of the 
master pattern F344MST3. The gel ^-coordinates of al! 
: ver protein spots lying within the CPK charge train were 
hen transformed into CPK pi positions by interpolation 
>erween the positions of immediately adjacent standards 
Table 1) using a Kepler* vector procedure. 

t has proven possible to compute fairly accurate pi values 
or many proteins from the amino acid composition [42]. 
% have attempted here to test a further elaboration of this 
ipproach. in which we computed pfs for the CPK standards 
;faemselves, based on our knowledge of the rabbit muscle 
CPK sequence and the fact that adjacent members of the 
iarge train typically differ by blockage of one additional ly- 
sine residue (Table 3). We compared these values to similar 
. computed pfs for an additional set of carbamylated stand- 
ards made from human hemoglobin beta chains and a se- 
ries of rat liver and human plasma proteins of known posi- 
tion and sequence (Fig. 7. Table 4).The result demonstrates 
good concordance between these systems. Two proteins 
show significant deviations: liver fatty-acid binding protein 
(FABP; #1 in Table 4) and protein disulphide isomerase 
(GO in the table). The FABP spot present on F344MST3 
may represent a charge-modified version of a more basic 
grent spot closer to the expected p/, not resolved in the 
JEF/SDS gel. Of particular importance is the fact that, by 
jgmparing computed p7"s of sequenced but unlocated pro- 
lans with the CPK pA, we can assign a probable gel loca- 
tion without making any assumptions regarding the actual 
gel pH gradient. This offers a useful shortcut, given the va- 
lines of pH measurement on small diameter IEF gels. We 
Sye used this approach to compute the CPK pfs of all rat 
Sd mouse proteins in the PIRsequence database, as an aid 
Protein identification (data not shown). 

J order to standardize SDS molecular weight (SDS-MW), 
je have used a standard curve fitted to a series of identified 
goteins (Fig. 8). Rather than using molecular mass perse, 
*e have elected to use the number of amino acids in the 
Polypeptide chain, as perhaps a better indication of the 
iSgglh of the SDS-coated rod that is sieved by the second 
«Sunension slab. The resulting values were multiplied by 
(the weighted average mass of amino acids in se- 
enced proteins) to give predicted molecular masses. Be- 
"*e we use gradient slabs, we have not constrained the fit- 
curve to conform to any predetermined model; rather 
tried many equations and selected the best using the 
Sram Tablecurve* on a PC. The equation chosen was> 
+ e/x 3 , whereyis the numberof residues,* is the gel 



rcoordinate,tfis511.83,*is-0J731tndris331S3801.The 
resulting fit appears to be fairly good over a broad range of 
molecular mass. 



23 An example of rat liver gene regulation: Cholesterol 
metabolism 

Experiment LSBC04 was designed as a small-scale test of 
the regulation of cholesterol metabolism in vivo by three 
agents included in the diet: lovastatin (Mevacor*, an inhibi- 
tor of HMG-CoA reductase); cholestyramine (a bile acid 
sequestrant that has the effect of removing cholesterol 
from the gut-liver recirculation); and cholesterol itself. The 
first two agents should lower available cholesterol and the 
third should raise it, allowing manipulation of relevant 
gene expression control systems in both directions. Such 
an experiment offers an interesting test of the 2-D mapping 
system since most of the pathway enzymes are present in 
low abundance, many are membrane-bound and difficult 
to solubilize.and the pathway itself is complex. Approxima- 
tely 1000 proteins were separated and detected in liver ho- 
mogenates. Twenty-one proteins were found to be affected 
by at least one treatment, and these could be divided into 
several coregulated groups. 

3 3.1 MSN 413 (putative cytosolic HMG-CoA synthase) 
and sets of spots regulated coordinate!) or inversely 

One group of spots (including a spot assigned to the cvto- 
solic HMG-CoA synthase, MSN 413) showed the expected 
increase in abundance with lovastatin or cholestyramine, 
the synergistic further increase with lovastatin and choles- 
tyramine, and a dramatic decrease with the high cholesterol 
diet. Spot number 413 is the most strongly regulated pro- 
tein in the present experiment, showing a 5- to 10-fold in- 
duction after a 1 week treatmeni with 0.075 °/o lovastatin and 
l°/o cholestyramine in the diet (Figs, 9 and 10). Its expres- 
sion follows precisely the expectation foran enzyme whose 
abundance is controlled by the cholesterol level; it is pro- 
gressively increased from the control levels by cholestyra- 
mine, lovastatin and lovastatin plus cholestyramine, and it 
sinks below the threshold of detection in animals fed the 
high cholesterol diet. This spot has been tentatively identi- 
fied as the cytosolic HMG-CoA synthase, based on a reac- 
tion with an antiserum to that protein provided by Dr. Mi- 
chael Greenspan at Merck Sharp &Dohme Research Labo- 
ratories. This enzyme lies immediately before HMG-CoA 
reductase in the liver cholesterol biosynthesis pathway, and 
is known to be co- regulated with it. Spot 413 has an SDS 
molecular weight of about 54 000 and a CPK pi of - 1 1 .4, in 
reasonably close agreement with a molecular weight of 
57300 and a CPK pi of -15.7 computed from the known se- 
quence of the hamster enzyme [43]. 

Using a classical product-moment correlation test (Kepler 
procedure CORREL), a series of five additional spots was 
found to be coregulated with 4 13. The level of correlation 
was exceedingly high (> 95%). Two of these, 1250 and 933, 
are at similar molecular weights and approximately one 
charge more acidic than 413 (Fig. 9), indicating that they 
may be covalently modified forms of the 413 polypeptide. 
This suspicion is strengthened by the observation that both 
spots are also stained by the antibody to cytosolic HMG- 
CoA synthase. The remaining three correlated spots appear 



to comprise an additional related pair (1253 and lonn „r 

SS^r 6 a single spot nu9) ™o^" Z° 

Because these two presumed proteins are nre«. n . -.T^T 
soiichmg-CoA synthase is reponed to consist of onivipT 

S£2S5f? ^ cn2ymes - A sccond ° f s« spots 

spots probably represent additional enzymes or subunits 

3.3.2 MSN 235 and coregulated spots 

U h ] ir nro r S° f fi, | e , SPOtS - mainJ - v "prised of mitocbon- 
"SsvZ«, ,nC,Ud u mg PUtative »»o*ondria! HMG- 
.hfiSS M ?? tt ** howed a n,odest Eduction by lovasu • 
•in alone, but little or no effect with anv of the other tIS 

2^2'^ r " U 1 ,$ ,MI *» U »I because lovasta- 

pacjced triad at approximate! v 30 kDa. and are likHv ,„ 
by an antibody to the mitochondrialfom of h^G S 

na^on^ndfcate Cd GreenspM ™ b «ll^l^ fracti^ 

*™T f ™ ? 3 m » ocn ondrial location. The other two 
spots (633 at about 38 kDa and 724 at about 69 kDa) are 
each present at lower abundance than the membe^of £ 



"Hi 

proteins of the putative ^ 
mucar n orevariable^tn e i r ™™? 0ndr,a, P«h* av are 
aniinationofalltne^^^^^ 

tiutive statistical .e^^^^U*?* 
esting ^formation from large sms SSJi "J 1 " of 

express.on among the five individual ; 0 ? theTol ' rre,a "^ 
cholestyramine treatment group $£ effc« ? a,Ul,B ,n < 
differences in totaJ protein loading since Sv k n °' due '<> 
been removed bv scaling and lhe > hav e alreadv 

ferent regulation patterns^, b7dJ™' M *' i,h 
13).Such ejects raLth?;o^bL: h TZr Sd F ' 
tau« sets may be revealed SS^^!^^ 
cently large population of control animal r - V ° fa Su{T - 
any experimenul manipulation) ThiJ ap^Lh ' • W ' lhou: 
natural biological variaiion in proieiS«« CXPl ° i,m * 
drug effects, offers an imponan S, - ° n ,nstead or 
«ion of a large libra.- S^^^**** 



4 Conclusions 

Because of the widespread use of rat liver ; n h«,w w 
chemistry and in toxicolog 

and the imbiS»mpSSXd1i r :2™ ''"'^ 
effeett grows, we ^XSL'Z'ZSSS 

.SSSd " haniSUC t0XiC ° ,Ogy ,S T b <W de- 
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3JJ An example of an antl-synergistic effect 

3 J.4 Complexity .f the cholesterol syndesis pathway 

tin's effect on levels of wmt - rl? . ay be ,n ,ov asia- 
compounds Sm a« «2^?:^ A and reU " ed precursor 
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Afiw 9. Montage showing effects intl* 
region of MSN:4 13. The montage show 
smaJi window mto one portion of the 2-D 
pattern, one row of windows for each ear*" 
rimemal group, and one pane) for eacft fd 
in the experiment. The lefi-mosi patten 
in each row is a group-specific copy of th? 
master pattern followed by inc pattern* 
for the five individual rats in the group. 
The highlighted protein spots (filled or* 
les) are spot 4 13 (on the rtghi of each pi* 
el; identified as cytosolic HMG-CoAf**' 
thase) and two modified forms of it ( IJ* 
and 933). From the top. the rows <e: 
mental groups) are: high cholesterol 
trols. cholestyramine, lovastatin.and 
statin plus cholestyramine. 
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Figure to. Bargraph showing the quanma- 
25.000 30.000 " Ve efrects ° r v *"ou$ treatments on ihc 
abundance of MSN;4I3 (cyiosohc HMG- 
CoA synthase) in the gels of Fig. 9. 




Ftgurt IL Bargraphs of a series of six coa- 
gulated spots including MSN:413. In the 
bargraphs. the abundances of the appro- 
priate spot (master spot number shown at 
the top of the panel) in each animaj are 
shown. The five five-animaJ groups are in 
the order (left to right): high cholesterol, 
controls, cholestyramine, lovasutm, and 
lovasutin plus cholestyramine. Each bar 
within a group represents one experimen- 
tal animal liver (one 2-D gel). Note the cor. 
related expression of the 6 spot*, espe- 
cially in the two far right (most strongly in- 
duced) groups. 
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Fifvre Data on a second coreguliu: 
pro'jrofspoii.prrsent-casip.Fjc M r ." 
founfc expenmenu! group tlovasu::/ 
shows a modes: induction, while the fifir 
group (lovajtaun plus cholestyramine, 
does noi. 




Figure U. Data on spot MSN;367, presented as in Fig. II This P*"* 
shows unambiguously the anti-synergistic effect of Jovasiatin and 
tyramme (fifth group) as compared to lovastatin (fourth group) T** 1 *' 
ponse contrasts strongly with the regulation pattern seen in Fig- 11 ' 
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15 
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311 
566 
612 
549 
645 
629 
906 
755 
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434 
263 
426 
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520 
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414 



1204 
332 
787 
313 
607 



27 1164 
26 1263 

29 743 

30 766 

32 1216 

33 1145 

34 103? 

35 663 

36 712 
36 763 
39 304 

41 1165 

42 664 

43 1316 

44 1924 

46 1203 

47 1391 

48 309 

49 605 
.50 621 
51 1113 

-52 1820 

53 725 

54 2001 

55 722 

56 678 

57 1682 
56 1091 

59 1171 

60 1400 

61 1853 

62 1888 

65 735 

66 1263 

67 1252 
6B 779 
69 1064 

71 656 

72 638 
1582 



74 1570 

75 1264 



76 
77 
7B 
79 
80 



1338 
1833 
1767 
925 
534 



81 1611 
« 1412 



63 
64 



1471 
1662 



85 1566 
* 1817 
8? 516 

86 1569 
» 1706 
» 651 
»1 1415 
*2 1773 



93 
94 



1336 
1706 



434 
424 
417 
516 
524 
446 
605 
112 
417 
445 
555 
412 
606 
694 
470 
569 
607 
589 
362 
566 
447 
454 
587 
535 
522 
499 
177 
500 
630 
533 
302 
560 
585 
624 
506 
567 
297 
312 
407 
602 
296 
569 
545 
563 
556 
621 
564 
363 
565 
736 



363 
661 
347 
563 
479 
301 
1371 
696 
719 
329 
710 
545 
446 
696 



••35.0 
-24 J 

•16.0 
-25.2 
•15.3 
-71.6 
•14.0 
-17.5 
•20.9 
•6.7 
<-35.0 
-16.6 
«-35.0 
•16.1 
•9.0 
•8.0 
-17.6 
-17.2 
4.6 
•9.5 
-11J 
•14.9 
-18.7 
-17 J 
<-35.0 
•0.2 
•19.6 
-7.3 
-01 
•6.7 
•6.3 
<-35.0 
-22.5 
-21.8 
-10.0 
-0.9 
-18.3 
>0.0 
-164 
-19.6 
-2.5 
-10.3 
-9.2 
-6.2 
-0.6 
-0.4 
•18.1 
•6.0 
-8.1 
•16.8 
•10.8 
-20.6 
-21.2 
•3.6 
•3.8 
•6.0 
•7.0 
-0.6 
-1.5 
-13.6 
-26.1 
-1.0 
•6.0 
-5.0 
-2.7 
•3.4 
•0.9 
•27.0 
-3.5 
Z2 
-20.8 
•6.0 
-1.4 
•7.0 
Z2 



63.600 
102,900 
64.600 
101.000 
56.200 
50.000 
66.300 
6020 
67.000 
62.100 
63.600 
65.000 
66.000 
55.500 
54.900 
62.400 
49.000 
346.600 
66.000 
62.500 
52.400 
66.600 
46.000 
43.800 
59.800 
51,400 
48.600 
50.000 
74.600 
50.200 
62.300 
61.500 
50.100 
53.900 
55.000 
57.000 
170.600 
56.600 
37.300 
54,100 
69.000 
50.60C 
50.300 
47.600 
5620 
51.500 
90.500 
65.900 
6720 
43,900 
90.80C 
50.000 
53.100 
50.400 
52.300 
48.000 
51.000 
74,400 
51,700 
41.600 
43.600 
74.500 
44.500 
77.500 
5120 
56.900 
69.100 
17.400 
43.600 
4220 
61.700 
43.000 
5320 
62.300 
43.700 



65 1118 
* 1731 
87 1033 

66 1406 
69 578 

100 2004 

101 1106 



102 
103 
104 
105 



462 
665 

773 
312 



106 1769 

107 1565 

108 1692 

109 1482 



110 
111 



778 
1728 



113 1191 

114 1296 

115 662 

116 1146 

117 1548 

118 1050 

120 1530 

121 638 

122 1572 

123 23 

124 621 

125 129B 
12S 872 

127 1000 

128 1229 

129 1422 

130 1 776 



536 
756 
566 
565 

1149 
536 
623 
455 
630 

1182 

1117 
506 
720 

607 

503 

516 

700 

660 

165 

907 

610 

649 

577 



131 
132 
133 



1930 
660 
666 



134 1271 

135 1161 

136 453 

137 1858 

138 1504 

139 1488 

140 1669 

141 311 

142 1366 

143 1429 

144 615 

145 2006 

146 2006 

147 1070 

148 1347 

149 &41 

150 1645 

151 1269 

152 1507 

153 1722 

154 832 

155 1031 

156 1970 

157 1256 
156 1275 

159 1663 

160 1034 

161 1953 

162 1020 
164 1566 

166 1905 

167 1340 
166 1506 

169 1338 

170 1966 

171 800 

172 476 

173 Q19 



423 
712 
1433 
1474 
862 
921 
717 
311 
632 
499 
757 
537 
1019 
662 
1389 
1063 
823 
697 
707 
756 
1417 
915 
346 
1017 
566 
516 
1108 
578 
1481 
760 
236 
911 
448 
503 
294 
664 
183 
417 
620 
527 
771 
1482 
606 
565 
161 
563 
678 
541 
378 
958 
1314 



•9.9 

-2.0 
■11 4 
•6.1 
-238 
>0.0 
•10.1 
-28.5 
-20.2 
-17.0 
<-35.0 
-1.5 
-3.6 
-2.4 
-4.8 
-16.9 
-2.0 
•6.9 
-7.5 
-19.6 
-0.5 
-4.1 
-11.1 
-4.3 
-154 
-3.8 
«-35.0 
-21.9 
-7.5 
•14.7 
-12.0 
•84 
•5.8 
-1.4 
-0.1 
-20.4 
•20.2 
-7.9 
-9.3 
-29.7 
-0.6 
-4.6 
-4.8 
-24 
<-35.0 
-6.7 
•5.7 
-22.1 
>0.0 
>0.0 
-10.7 
-6.9 
-25.7 
•28 
-7.9 
-4.5 
•2.1 
-13.5 
-11.4 
>0.0 
•8.1 
-7.8 
•2.6 
-11.4 
>0.0 
-11.6 
•3.8 
-0.2 
-7.0 
-4.6 
•7.0 
>0.0 
-16.3 
-28.7 
13.7 



53.600 
40.700 
51.600 
51.700 
25.000 
53.700 
47,900 
6120 
37.300 
23.800 
26,100 
56.100 
42.500 
36.30C 
49.700 
55.500 
43.500 
44.500 
1 60.600 
34.10C 
48.70C 
36.500 
50.80C 
37.40C 
652C 
42.90C 

i5 t 3a 

13.90C 
36.00C 
33.5a 
42«X 
86, IOC 
37.3a 
57.0a 
40.7a 
53.8a 
29.7a 
36.00C 

i6.8a 
28.1a 

37.7a 
43.7a 

43.2a 

40.7a 
15.80C 
33.8a 
77.9a 
29.8a 

51.6a 

55.3a 
26.5a 
50 800 

13,7a 

40.500 

117.0a 
33.9a 
62,ia 

56.6a 

91,4a 

44.400 

162.400 
65.900 
37.8a 
54.6a 
40.000 

13.7a 

38.400 

5i,7a 

164,9a 
50.400 
44,7a 
53.5a 
71.800 

32.i a 

1620 



MSN 



CWW SOSUW 



174 1364 

175 825 

177 1562 

178 1371 

179 1069 
1666 



411 



160 
181 
162 
164 1660 
185 1997 
166 279 



163 
383 

553 
710 
615 
567 
295 
730 



187 


773 


166 


1536 


191 


1560 


192 


1816 


193 


1469 


194 


1380 


195 


784 


196 


1227 


197 


667 


196 


2006 


199 


1711 


2a 


872 


21 


222 


202 


736 


203 


786 


204 


1224 


205 


436 


206 


1994 


207 


1895 


208 


240 



210 17a 

211 902 

213 1087 

214 1340 

215 1561 

216 1565 

217 1159 



218 
219 



931 
713 



220 1479 

221 965 
223 934 

225 1812 

226 821 

227 1566 
226 1065 

229 1577 

230 1458 
232 1440 

234 1692 

235 618 



236 
237 



920 
952 



236 1611 
239 1409 



240 
241 



501 
1620 



242 1357 

243 711 

244 1855 

245 1189 

246 551 

247 1348 

248 460 

249 1733 

250 1974 

251 806 



252 
253 
254 



674 
753 
995 



255 1660 

256 OCU 

257 506 
256 1517 



1017 
1113 
296 
807 
674 
667 
555 
266 
632 
1185 
553 
661 
674 
424 
435 
253 
629 
589 



571 
687 
1418 
496 
517 
664 
666 
495 
755 
393 
572 
177 
911 
927 
716 
1045 
411 
1463 
567 
690 
496 
849 
489 
1004 
1136 
1006 
541 
720 
448 
569 
656 
1162 
621 
474 
459 
604 
448 
451 
786 
392 
553 
648 
450 
679 
1006 
464 
820 



•6.7 
-15.7 
•3.6 
-7.2 
-10 4 
•0.5 
-32.1 
-16.2 
•0.6 
>0.0 
<-35.0 
-17.0 
-4.2 
•3.9 
-0.9 
-5.0 
•64 
-16.7 
•84 
-20.1 
>0 0 
-2.2 
-14.7 
<-35.0 
•18.0 
•16.7 
•6.5 
-30.9 
>0.0 
-0.3 
<-35.0 
-2.3 
-14.1 
•10.4 
-7.0 
-3.5 
•3.6 
-9.3 
-13.5 
-18.7 
-4.9 
-126 
-13.5 
-1.0 
-15.8 
-3.6 
-10.8 
•3.7 
-5.2 
•5.5 
-2.4 
-22.0 
-13.7 
-13.1 
-3.2 
-4.8 
-27.7 
-0.9 
-6.8 
-18.7 
-0.6 
•6.8 
-25.1 
-6.9 
-29.3 
-1.8 
>0.0 
•16.1 
•14.6 
•17.6 
-12.1 
-24 
-12.1 
274 
•44 



162.900 
69.300 
52.600 
43.000 
48.3a 
51.600 
9120 
4220 

34.5a 

29.6a 
26.3a 
90.8a 
38 400 

44.9a 

4420 

52.400 
101.6a 
47.3a 
23.7a 
52.6a 
4420 
44.9a 
55.000 
63.7a 
" 107.8a 
37.42 
50.000 

3i.ia 
51.3a 
44.2a 

1520 
57.0a 
55.42 
44.42 

45.2a 

57.3a 

40,7a 

69.3a 

51 .2a 

170.5a 
33.9a 
33.3a 

42.7a 

28.8a 

66 ea 

n.6a 
51.6a 

3420 

57.3a 
36.5a 
57.9a 

2sa 

25.42 

30.2a 

S3.5O0 
42.500 

62.ia 
51,4a 

45.8a 
23.8a 
4820 
59.3a 

6i. oa 

49.100 

62,ia 
6i.8a 

39.2a 
69.52 
52.5a 
36.5a 

61 .oa 

44.6a 

22a 

60.42 

37.1 
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M9« 
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511 » 
51* 1«9 

513 1606 

514 Ma 

515 <*1 

516 1334 

517 666 
519 796 
519 622 
52D 632 
S» 1332 

522 603 

523 1160 

524 479 

525 766 
536 74? 

527 1170 

528 1502 
530 1726 

532 507 

533 670 

534 1347 

535 1513 

536 306 

538 1851 

539 1463 

540 909 

541 625 

542 1164 
SC 803 

544 1259 

545 656 

546 803 

547 1162 
546 126 
549 1355 
90 595 

52 1369 

53 992 
555 1125 

555 705 

557 1477 

556 960 

558 700 
560 1028 
562 896 
56< 789 

565 777 

566 980 

567 U19 

569 1212 

570 760 



484 

533 
1034 
636 
543 
1044 
1021 
779 
670 
165 
830 
1104 
309 
1226 
1066 
1016 
231 
542 
620 
1011 



571 616 
573 1142 



574 
575 



532 
771 



576 1068 

577 822 

578 9i4 

579 1064 

560 1524 

561 1392 

562 982 
5*4 iw 
*5 756 

567 wo 

s2 1888 

* 1317 



1065 
346 
654 
689 
982 
561 
289 
196 
655 
1143 
1526 
1071 
274 
1321 
1122 
866 
494 
405 
410 
975 
1030 
583 
1109 
621 
794 
1446 
766 
328 
611 
661 
594 
956 
771 
787 
250 
534 
734 
754 
794 
714 
783 



591 
562 
550 
594 



65 
1014 
732 
1627 



fee 



672 
731 
1152 
523 
774 
485 
519 
1546 
614 
176 
478 
1426 



-16.0 
•10.2 
-3L3 
-13.2 
-26.5 
-7.1 
-14.8 
•164 
•15.7 
•21.5 
-7.1 
-22.6 
4.9 
-26.6 
•17.2 
-17.7 
-9.2 
-46 
-2.0 
-274 
-14.7 
-6.9 
-4.5 
<-35.0 
•0.7 
-5.1 
-13.9 
-21.7 
-9.2 
-16.2 
•6.0 
-15.0 
•16 2 
-9.3 
c-35.0 
•6.6 
-23.0 
•6.6 
-12.2 
•9.6 
-18.9 
-4.9 
-12.5 
-19.1 
-11.5 
-14.1 
-16.6 
-16.6 
-12.5 
-44 
-8.6 
-17.4 
•21.9 
-96 
•26.2 
-17.1 
-10.6 
•15.7 
-13.8 
•10.6 
-44 
-6.3 
-12.4 
-4.8 
-17.4 
-19.5 
-13.5 
-0.4 
-21.1 
•7.3 
<-35.0 
-11.7 
-18.1 
•3.0 
-11.6 



56.400 
54.100 
29.200 
47.100 
53.400 
26.800 
29.700 
39.600 
45.100 
169.000 
37JO0 
26.600 
66.800 
22.300 
28,000 
29.800 
119.600 
53 400 
48,000 
X.000 
57.900 
27.300 
77.800 
46.000 
44.100 
31.100 
52.000 
63.100 
146.200 
45.900 
25.200 
12.200 
27.800 
98.400 
19.000 
25.900 
35.800 
57.500 
67.600 
66.900 
31.400 
29.300 
50400 
26.400 
48.000 
38.600 
14.900 
40.200 
61.600 
48.600 
45.600 
49.700 
32.100 
40.000 
39.300 
109.200 
54.100 
41.800 
40.800 
38.900 
42.800 
39.400 
44.200 
45.000 
41.900 
24.900 
55.000 
39.900 
56,300 
55.300 
11300 
46.400 
17Z300 
59.000 
15.500 



MSN 



506 619 

597 1176 



509 

600 
601 
602 
603 
604 
606 



1465 
741 
907 
687 

712 



783 
736 

606 629 

607 1064 

608 883 

609 2012 

610 12S5 
612 11Q3 



613 
614 



778 
-824 



615 1095 

616 1759 



617 
616 



994 

751 



619 1429 

620 1050 



621 
622 
623 
624 



923 
1462 
759 
756 



625 1436 

626 1096 



942 



627 
628 

629 899 

630 1135 

631 979 

632 1542 

633 1345 

634 409 

635 1165 

636 774 

637 1263 
636 952 

639 1 717 

640 994 



641 
642 
643 
644 
645 
646 



165 
803 
719 
1100 
534 
1153 

648 1246 

649 14 

650 1713 

651 1966 

652 1378 

653 1442 

654 650 

655 mi 

656 1095 

657 1524 

658 1777 

659 391 



660 
661 
662 



877 
656 

732 



663 1787 

664 886 

665 889 

666 715 

667 781 
666 646 
666 1116 

670 1362 

671 S47 
673 964 



269 
461 
1044 
1186 
402 
656 
1138 
181 
1461 
223 
273 
286 
503 
610 
903 
391 
265 
516 
195 
478 
372 
374 
518 
520 
1105 
622 
225 
1038 
606 
1089 
548 
621 
979 
1321 
615 
1076 
614 
950 
704 
604 
524 
411 
575 
292 
1224 < 
251 
296 
294 
1263 
1038 
204 
1406 
1049 
1163 
616 
1165 
806 
551 
661 
540 
860 
584 
565 
166 
312 
567 
268 
775 
221 
227 
165 
353 
643 
786 
746 



•21.9 
-9.1 
•5.0 
-17.9 
•14.0 
•19.5 
-18.7 
-14.1 
-16.7 
-16.0 
-21.6 
-10.6 
-14.5 
>O.0 
-8.1 
-10.1 
•16.9 
-15.7 
-10.3 
-1.6 
•12.1 
-17.6 
-5.7 
•11.1 
-13.7 
-5.1 
-17 4 

-174 

-5.5 
■10.2 
-13.3 
-16.0 

-14.1 

-9.6 
-12.5 
•4 1 
-6.9 
-32.2 
-9.2 
-17.0 
* -8.0 
-13.1 
-2.1 
-12.1 
<-35.0 
-16.2 
-18.5 
-10.2 
-26.1 
-9.4 
-8.2 
<*35.0 
•2.1 
>00 
-65 
-5.5 
-20.8 
-10.0 
-10.3 
-4 4 
-1.4 
•33 4 
-12.5 
-20.5 
-18.1 
-1.2 
•14.4 
-14.3 
•16.6 
•16.6 
-21 .0 
-96 
•64 
-25.3 
-12 4 



100.500 
60.700 
28.800 
23.600 
66.000 
45.000 
25.400 
165.200 
14.400 
125400 
96.700 
94.000 
56.700 
46.700 
34.200 
69,600 
102.000 
55.400 
149,100 
56,000 
72.000 
72.400 
55.300 
55.200 
26.600 
47.900 
124.000 
29.000 
48.900 
27.200 
53.000 
48.000 
31,300 
19.100 
46.300 
27.600 
38.000 
32.400 
43.300 
49.000 
54.600 
66,700 
51.000 
92.000 
22.400 
106.900 
90.700 
91.400 
21.000 
29.000 
140.000 
16,200 
28.600 
23.800 
38.000 
24.400 
36.400 
52.700 
36.000 
53.600 
36.000 
50400 
51.700 
187.500 
86.100 
51,500 
100.900 
39.600 
126,300 
122.400 
169.100 
76.300 
46.600 
39.200 
41.200 
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674 
675 
676 
677 
678 
679 



681 
662 
663 
664 
685 
686 
687 
686 
689 
690 
691 
692 
693 
694 
695 
696 
697 
696 
699 
702 
703 
705 
706 
707 
709 
710 
712 
713 
714 
715 
716 
717 
718 
719 
721 
722 
723 
724 
725 
726 
727 
728 
729 
730 
731 
733 
734 
735 
736 
738 
739 
740 
741 
742 
743 
744 
745 
746 
748 
749 
750 
751 
752 
754 
755 
756 
757 
760 



1661 
1523 
708 
919 
1085 
600 
1237 
1103 
1406 
1596 
555 
1167 
1932 
1545 
1456 
1011 
1995 
812 
1154 
1993 
1628 
928 
1854 
1997 
957 
1540 
577 
1610 
1278 
1841 
1018 
1074 
293 
720 
1386 
1328 
698 
701 
1875 
575 
1216 
1069 
1272 
956 
763 
720 
1476 
1846 
510 
1217 
1856 
665 
1321 
719 
1101 
1359 
696 
667 
1205 
995 
896 
881 
1951 
726 
999 
182 
2005 
1448 
792 
469 
664 
1195 
1821 
909 
790 



562 
642 
615 
551 
923 
1Q04 
263 
477 
249 
699 
1313 
790 
619 
764 
953 
270 

1461 
619 
656 
254 
715 
345 
563 
730 
900 
562 
571 
704 
1386 
1145 
889 
412 
841 
263 
433 
481 



702 
204 
464 
506 
622 
395 
916 
415 
473 
783 
1126 
724 
765 
312 
427 
473 
569 
220 
409 
256 
563 
596 
181 
686 
166 
643 
1503 
649 
575 
266 
296 
254 
164 
1113 
246 
133 



-2.7 
-44 

-16.8 

-13.7 
•10.5 
-22.7 
-8.3 
10.1 
-6.1 
•34 
•24.8 
-9.2 
00 
-4.1 
•5.2 
-11.8 
>0.0 
-16.0 
-9 4 
>0.0 
•3.0 
-13.6 
•0.6 
>0.0 
•13.0 
-4.2 
-23.8 
-3.2 
-7.8 
-0.7 
-11.7 
-10.7 
<-35.0 
-18.5 
•64 
-7.1 
-19.1 
-19.0 
-0.5 
-23.9 
-6.6 
•10.8 
-7.9 
-13.0 
-17.3 
-18.5 
-4.9 
-0.7 
-27.3 
-6.6 
-0.6 
-20.2 
-7.2 
-16.5 
-10.2 
-6.7 
-19.2 
-19.5 
-8.7 
-12.1 
-14.1 
•14.5 
>0.0 
-18.3 
-12.0 
<35.0 
>00 
•5 4 
-16.5 
-28.9 
-20.3 
•8.6 
•0.9 
-13.9 
-16.5 



63L100 
51.900 
46.700 
46.300 

52.700 
33 400 

30.300 
95.100 
56.100 
109.800 
43.500 
19.300 
39.100 
48.100 
40.300 
32.300 
100.2QO 
34.900 
144O0 
37.800 
45.900 
107.000 
42.700 
78.000 
. 51.800 
42.000 
34.400 
51,900 
51.200 
43.300 
16.900 
25.100 
34 800 
66.600 
36.800 
103.100 
63.900 
56.700 
43.600 
43.400 
140.400 
60.400 
56.400 
37.700 
69.100 
33.700 
66.200 
59.400 
39.400 
25.800 
42.300 
40.300 
85.900 
64 600 
59.500 
51,400 
127.600 
67.000 
106.200 
51.900 
49.500 
165.900 
44,200 
163.600 
46.600 
13.000 
46.300 
51.000 
101.900 
90.600 
107.000 
161,000 
26.300 
111.000 
264 900 



926 



761 1300 

763 1416 

764 2Q20 

765 651 

766 1052 

767 I960 
760 1330 
760 1070 

770 857 

771 1337 
773 1576 

775 969 

776 1430 

777 1539 
770 050 
770 700 
7B0 1052 

784 1413 

785 1364 

786 1822 

787 833 

700 816 

701 451 

702 777 

703 1536 

704 1461 

706 388 

707 1126 

708 833 



y cpkdi sosmw 



733 



475 
1140 
468 
685 
613 
617 
874 
502 
824 
708 
458 
434 
411 
1136 
520 
685 
835 
382 



799 1420 
600 1759 



801 
802 



624 



1775 
573 
203 
980 
902 

808 625 

809 1851 



804 
805 
806 
807 



610 
811 
812 
813 



440 

1358 
851 
745 



614 2028 

615 1086 

616 629 

817 1376 

818 1771 
619 1045 

820 984 

821 1712 
622 1256 

823 1517 

824 1442 

825 1240 
626 1309 
827 2012 
628 837 

830 1342 

831 562 

832 1073 

833 481 
634 501 
637 751 
836 635 

839 1494 

840 1952 

841 1565 

842 571 
643 1325 
6*4 1727 
845 630 

646 2016 

647 673 



1429 
377 
1543 
807 
546 
212 
437 
593 
279 
665 
547 
I486 
196 
494 
1039 
306 
827 
1015 
573 
249 
393 
1246 
810 
645 
313 
1177 
700 
263 
362 
279 
205 
654 
449 
513 
1014 
708 
1405 
756 
826 
1039 
620 
581 
740 
833 
459 
301 
1080 
1312 
649 
301 
679 
90S 
1200 



-6-2 
•5.9 
>0.0 
-20.0 
-11.1 
>0.0 
-7.1 
>0.0 
-15.0 
-7.0 
-3.7 
-12.0 
-5.5 
-4.2 
-15,1 
-19.1 
•11.1 
-6.0 
-6.7 
-0.9 
-14.3 
-22.0 
-29.8 
-16.9 
-4.2 
-5.1 
-33.6 
-9.8 
■13.5 
-59 
-1.6 
-21.7 
-14.2 
•1 4 
•24.0 
<-35.0 
-12.5 
-14.1 
-21 .7 
-0.7 
-30.9 
-66 
-15.1 
-17.8 
>0.0 
-104 
-21.6 
-6.5 
-14 
-11.2 
-124 
-2.2 
-8.1 

-4 4 

-5.5 
4.3 
•7.4 
>O.0 
-134 
-7.0 
-24.5 
-10.7 
-28.5 
-27.0 
-17.6 
-21.3 
-4.7 
>0.0 
-3.6 
-24.1 
-7.2 
-2.0 
•21.5 
>0.0 
19.9 



41 AO 
27.300 
51.400 
50.300 
25.000 
59.900 
44.300 
48.500 
4&200 
31.500 
56.700 
37.600 
43.100 
61 MO 
63.800 
66.800 
25500 
54.400 
35.000 
37.100 
69.500 
35.100 
15.400 
72.000 
11.700 
36.300 
53.100 
133.700 
63.400 
49800 
96.500 
35.800 
53.000 
14.200 
148.400 
57.400 
20.000 
87.200 
37.500 
29.900 
51.100 
109.700 
69.400 
21.600 
38.200 
46.500 
85.700 
24.000 
39.100 
103.100 
74.600 
96.700 
139.200 
46.000 
62.000 
55.800 



43.100 
16.200 
40.700 
37.500 
29.000 
37.800 
50.500 
41.100 
37.200 
60.900 
89,300 
27.500 
19,400 
46.300 
89.200 
44.600 
34.200 
23,200 



MSN 



Y CPW SOSMW 



848 1863 

649 1166 

650 1535 
851 1035 
652 634 
855 499 

656 1063 

657 687 

658 1448 

659 706 
1070 

472 
674 
1307 
645 
627 
665 
1807 



860 
861 
862 
864 
865 



869 

870 1323 

871 1228 

872 1904 

873 556 

674 1540 

675 1566 

676 1166 
877 1076 



870 
679 



1161 
647 



880 1 756 

881 1543 
883 1432 
684 922 



885 

866 

887 

686 

869 

890 

891 

892 

894 



1103 

1501 
798 
636 
951 
717 

1123 
891 

1245 



895 1962 

896 1322 



897 



900 
901 
903 
904 
905 
907 
906 
910 
911 

913 1606 

914 1237 

016 1442 

017 1260 
019 764 
920 1133 

1123 
829 



420 
662 
845 
624 
931 
799 
765 
775 
888 
826 
681 
1544 



821 
923 
824 ii3i 
925 
926 



1441 

679 

927 1487 

928 1082 

929 1 231 
931 1609 



932 
933 
934 
936 



810 
965 
947 
865 



937 1421 



271 
523 
1024 
826 
S42 
220 
194 
890 
639 
311 
1066 
347 
480 
490 
887 
10O4 
494 
402 
783 
1031 
346 
647 
756 
777 
351 
720 
1111 
757 
594 
276 
690 
689 
414 
607 
1103 
634 
759 
S46 
229 
413 
234 
346 
626 
570 
426 
243 
703 
1094 
229 
520 
869 
824 
1303 
1544 
301 
367 
686 
749 
367 
1541 
1123 
380 
242 
316 
674 
219 
1191 
775 
816 
670 
900 
520 
462 
843 
1056 



-0.6 
-9.2 
•4.2 
•11.4 
-15.5 
-27.8 
-10.9 
-14.4 
-54 
-18.9 
-10.7 
-28.6 
•18.9 
-7.4 
•21 .0 
•15.6 
-19.5 
-1.0 
-7.2 
-6.4 
■0.3 
-24.8 
-4.2 
-3.8 
-8.6 
-10.6 
•9.3 
-20.9 
-1.6 

-4 1 

•5.7 
-13.7 
-10.1 
-46 
•16.3 
•21.3 
-13.1 
-16.6 
•9.6 
-14.3 
-6.2 
>0.0 
-7.2 
-31.4 
•20.3 
•15.3 
-21.7 
-13.5 
-16.3 
-17.2 
•17.0 
•14 4 
-15.6 
•19 7 

-4.1 

-3 3 
-6.3 
-5.5 
-6.0 
-17.3 
-97 
•9.6 
-15.6 
-9.7 
-5.5 
-19.7 
-4.6 
-10.5 
-64 
•3.3 
-16.0 
-12.6 
■13.2 
-14.6 
■59 



99.500 
54.900 
29.600 
37.500 
53.400 
127.100 
150.500 
34.800 
46.900 
86.200 
28.000 
77.600 
58.600 
57.000 
34.900 
30.300 
57.400 
66.000 
39.400 
29.300 
77.700 
46.400 
40.700 
39.700 
76.800 
42.500 
26.400 
40.700 
49.700 
97.100 
34.600 
44.100 
66.400 
46.900 
26.600 
47.200 
40.600 
52.900 
121.200 
66.400 
117.800 
77.700 
47.700 
51.300 
64.500 
113,000 
43.400 
27.000 
121.000 
55.200 
34.800 
37.600 
19.700 
11.700 
89.100 
70.400 
44.100 
41.100 
73.700 
11.700 
25.900 
71.500 
113.200 
84.300 
35.400 
126.200 
23.500 
39,800 
36.000 
45,100 
34.400 
55.100 
60.600 
36.800 
28.400 



MSN 


x 


839 


1197 


941 


1765 


942 


602 


943 


312 


944 


993 


945 


1300 


946 


630 


947 


187 


040 


1380 


949 


1766 


950 


1036 


951 


860 


952 


957 


954 


503 


955 


1938 



957 1010 



950 
960 
961 
962 
963 
964 
965 



766 
596 
557 
867 
564 
669 
671 



966 1204 

967 910 
966 609 
969 1285 



822 
976 
403 

279 



970 
971 
972 
974 
975 

976 1124 

977 994 

978 1612 

979 749 

980 1064 
961 1197 

983 1762 

984 1344 

985 1024 



967 
966 
990 
991 



739 
816 
785 
1159 



992 1090 

993 1030 



994 

995 
996 



847 
902 
888 



997 1815 
996 1205 



827 
885 
472 
486 

491 
2G9 
423 
736 
344 
665 
193 
152 
701 
547 
712 
816 
174 
419 
409 
320 
334 
1156 
255 
796 
154 
1046 
206 
232 
437 
567 
495 
961 
295 
664 
642 
1141 
642 
911 
1506 
317 
1105 
1159 
555 
361 
317 
928 
701 
811 
461 
847 
579 



999 
1000 
1001 



617 
966 
970 



1002 1736 

1003 643 



1006 
1007 
1009 



622 
675 
291 



1010 1386 

1011 459 

1012 679 

1013 1818 

1014 1032 

1015 1629 

1016 1311 

1017 1722 

1018 1015 

1020 1574 

1021 781 

1022 1129 

1023 812 

1024 785 

1025 1290 



504 
299 
290 
771 
478 

1164 
467 
279 
644 
745 
541 
661 

1128 

634 

994 
1134 

424 

743 
1219 

464 
83 

317 

446 

739 



-8.8 

■1.5 
•22.7 
O5.0 
*12 1 
-7.5 
*2l 6 
<-350 
-65 
-V5 
■11.3 
-14.9 
-13.0 
*27.6 
>0.0 
■11.8 
•17.2 
-23.0 
•24 8 
•14 4 
•24.5 
-12.6 
•20.0 
-6.7 
-13.9 
-22.3 
-7.7 
-15.6 
-12.6 
•32 6 
<-35.0 
•15.3 
-9.6 
-12.1 
-32 
•17.7 
-10.8 
-6.6 
-1.6 
-6.9 
-11.5 
-17.9 
•15.9 
-16 7 
-9.3 
-104 
•11.5 
•15.2 
-14.1 
-14 4 
-0.9 
■6.7 
-22 0 
-12 6 
•12.7 
-1.9 
-21 1 
•15.8 
•14 6 
<-35.0 
-64 
•29 4 
-19.7 
■09 
-11 4 
-3.0 
-7 4 
•2 0 
•11.7 
-3.7 
•166 
-97 
-15.9 
-16.7 
7.7 



59.6cc 
57.1* 

«5.1* 
41.Q* 

78JQC 
45 4fc 

151. ODC 
213.Q0C 
43.400 
53.CC0 

37.800 
174.9C6 
€5.7tt 
67.1CC 
83.90C 
80 5CC 
24.0CC 
106.60c 
38.700 
*10.30C 
26.700 
138.900 
119.30C 
63 4CC 
S1.6CC 
57.40C 
31JCC 
91.100 
45 400 
46.700 
25.300 
46.700 
33.900 
12800 
84.700 
26.600 
24 600 
5240C 
74.900 
64 500 
33.300 
43400 
38.200 
60.700 
36.600 
50.700 
56.500 
93100 
92.700 
40.000 
58 900 
23.7W 
58.100 
96.400 
46.600 
41.300 
53 SOD 
4 5 SOD 
25800 
47.30D 
30 TOO 
25 500 
65.000 

41.300 
22L5DD 

58.400 

591.3* 
04.8D8 
62.400 
41.S* 



40* 
1296 
856 
1284 
986 
1547 
1301 
1525 
1128 
1226 
1761 
541 
816 
1036 
1439 
1540 
1576 
1089 
949 
426 
1583 
770 
1613 
1380 
284 
1261 
393 
1617 
1245 
1258 
705 
1181 
529 
508 
1898 
873 
1788 



871 
1697 
1157 
620 
1867 
2019 
1546 
1545 
61 
1954 
588 
1050 
457 
1884 
1714 
1717 
1976 
547 
1348 
1385 
1078 
878 
1202 
1022 
1005 
1512 
1114 
1464 
1040 
1122 
1722 
L* 1008 
1830 
784 
1968 



D*Uh«« Of fit tottf 



927 



Y CPKpi SOSMW 



M9< 



Y CPK* SOSUW 



1(87 



1090 
1081 
102 
1033 
1034 
1035 
1036 
1030 
1010 
1011 



405 
1206 

856 
1284 



552 



1045 
1047 
1046 
1040 
1060 
1061 
1062 
1063 
1064 
1066 
1066 
1066 
1060 
1061 
10G2 
1064 
1065 
1066 
1067 
1066 
1060 
1071 
1073 
1075 
1076 
1078 
I0B1 
1063 
1065 
1090 
1092 
1063 
1004 
1005 
1096 

tooo 

1101 
1102 
'103 
105 

;i06 

107 

noe 
mi 

112 
115 
116 
=117 
118 

'lie 
:i20 
'121 
122 
123 
'125 
126 
126 
133 
139 
147 
146 



1S47 
1361 
1525 
1128 
1226 
1761 
541 
818 
1Q36 
1430 
1540 
1576 
1080 
049 
426 
1583 
770 
1613 
1380 
284 
1261 
303 
1817 
1245 
1256 
705 
1161 
520 
506 
1806 
873 
1766 
836 
1863 
826 
071 
1697 
1157 
620 
1867 
2019 
1546 
1545 
61 
1054 
588 
1050 
457 
1864 
1714 
1717 
1076 
547 
1346 
1385 
1078 
075 
1202 
1022 
1005 
1512 
1114 
1464 
104B 
1122 
1722 
1006 
1630 
764 
1866 



547 
226 
822 
403 
551 
406 
645 
274 
262 
630 
010 
485 
407 
250 
635 
411 
1040 
816 
1385 
1002 
620 
377 
663 
746 
805 
645 
746 
792 
034 
734 
656 



604 
600 
1128 
773 
861 
566 
483 
202 
7*4 
910 
507 
804 
538 
477 
035 
237 
1046 
667 
707 
532 
640 
546 
722 
1066 
621 
762 
616 
787 
033 
1076 
616 
1301 
677 
452 
857 
802 
882 
625 
560 
1182 
724 



-321 
-7.5 
-15.0 
-77 
•12.3 
-4.1 
-64 
-4.3 
•0.7 
45 
•1.6 
•2S.7 
-15.6 
-11.3 
-5.5 
-4.2 
-37 
•10 4 
-13.2 
-31.1 
•3.6 
-16.8 
-3.2 
•6.5 
<45.0 
-8.0 
•33.3 
-0.0 
-6.2 
•8.1 
-18.0 
-0.0 
-26.3 
-27.4 
-0.3 
-147 
-1.5 
-15.4 
-0.6 
-15.7 
•12.7 
•2.3 
•9 4 
-21.9 
-0.5 
>0.0 
-4.1 
-4.1 
<-35.0 
>0.0 
•23.3 
-11.1 
-20.5 
-04 
-2.1 
-2.1 
>0.0 
•25.3 
-6.0 
-64 
-10.6 
-12.6 
-87 
-11.6 
-0.3 
-4.5 
•0.9 - 
-5.1 
-11.1 
•0.6 
-2.1 
•10.2 
•0.8 
•17 J 
>0.0 



52.800 
36.500 
53.000 
123.200 

37.700 
67.000 
52.700 
57.200 
46.500 
06.300 
103.600 
36.000 
34.000 
56.300 
67.300 
100.200 
47.100 
66.700 
28.000 
37.800 
16.000 
27.000 
48.000 
72.000 
45.500 
41,200 
40.000 
46.600 
41.200 
30.000 
33.000 
41.600 
45.800 
43.700 
49.100 
46,700 
25.600 
30.000 
36.000 
51.600 
56.500 
142.300 
36.900 
34.000 
48,500 
34.600 
53.700 
59.100 
33.000 
116,000 
28.600 
45.200 
38.800 
54,200 
46.300 
53.100 
42.400 
28.000 
48.000 
40.400 
38.000 
30.300 
33.100 
27.600 
46.300 
10.700 
44.700 
61.700 
36.200 
36.600 
34.700 
37.500 
51.400 
23.800 
42.300 



MSN 



115) 
1154 
1161 
1162 
1163 
1168 
1170 
1171 
1172 
1174 
1176 
1177 
1178 
1178 
1180 
1181 
1182 
1183 
1184 
1165 
1186 
1180 
1100 
1101 
1102 
1103 
1104 
1105 
1106 
1107 
1106 
1109 
1200 
1201 
1202 
1203 
1204 
1205 
1206 
1200 
1210 
1211 
1212 
1214 
1215 
1216 
1217 
1216 
1210 
1220 
1221 
1222 
1223 
1224 
1225 
1226 
1227 
1228 
1220 
1230 
1231 
1232 
1233 
1234 
1235 
1236 
1237 
1236 
1230 
1240 
1241 
1242 
1243 
1244 
1245 



Y CP** SOSUW 



821 
15*4 
637 



1156 



665 
564 
552 
536 
545 
1000 
1304 
1366 
1606 
1485 
1459 
1431 
1407 
1383 
1454 
1422 
1304 
1171 
14S7 
686 
265 
403 
344 
505 
572 
639 
637 
614 
637 
1005 
1719 
791 
064 
313 
306 
320 
326 
304 
402 
386 
641 
660 
914 
673 
970 
1021 
1392 
1354 
1362 
673 
614 
603 
606 
707 
475 
466 
750 
1324 
1583 
1865 
1812 
1411 
1392 
704 
769 
740 
743 
713 
862 
663 
565 



400 
307 
307 
528 
529 
524 
514 
522 
586 
530 
702 
224 
224 
223 
223 
224 
182 
183 
182 
214 
286 
1114 
803 
1202 
1275 
1311 
1203 
1502 
1402 
1407 
1431 
1304 
1545 
666 
1021 
105 
104 
197 
197 
204 
204 
294 
329 
329 
266 
245 
372 
296 
205 
203 
205 
540 
542 
539 
623 
628 
447 
1282 
1461 
1170 
1005 
809 
817 
703 
662 
410 
407 
406 
511 
510 
509 
504 
562 



-137 
•3.5 
•21.3 
•21 J 
•20.2 

-244 

•25.0 
•25.9 
-25.5 
•10.2 
-7.5 
•6.6 
-3.3 
-4.8 
.-5.2 
-57 
-6.1 
-64 
-5.3 
•5.8 
-6.3 
•9.2 
•5.2 
-19.5 
c-35.0 
-32.6 
<-35.0 
•27.6 
-24.1 
-21.2 
-21.3 
-22 1 
-21.3 
-10.3 
-2.1 
•16.5 
-12.9 
<-35.0 
<*35.0 
<-35.0 
<-35.0 
-33 2 
•32 7 
•33 7 
-21 .2 
-20 4 
-13.8 
•14.7 
-12.7 
-11.6 
-63 
-6.8 
•67 
-19.9 
-221 
-22.6 
-19.2 
-16.9 
•287 
-290 
-17.4 
-7.2 
-3.6 
-0.6 
•1.0 
-6.0 
-63 
•16 4 
•17.1 
-17.9 
-17.8 
-167 
•18.6 
•20.3 
•24.4 



24.700 
35.000 
66.400 
68.800 
66.700 
54.500 
54.500 
54.600 
55.700 
55.000 
50.200 
53.700 
43.400 
124.000 
124.000 
125.100 
125.200 
124.700 
164.400 
162.600 
164,300 
131.800 
04.200 
26.200 
34,700 
20.000 
20.600 
19.400 
20.000 
13.000 
16.300 
16.200 
15.400 
16.600 
11.600 
45.200 
29.700 
146.700 
149.600 
147.400 
146.600 
91.400 
91.200 
91 400 
81.600 
81.600 
101.800 
112.000 
72.900 
90.100 
139.500 
141.800 
139.500 
53 600 
53 400 
53.600 
47.800 
47.500 
62.300 
20,400 
14 400 
24.200 
30.300 
36.200 
37.900 

43 400 

44 500 

66.900 
67.300 
67.500 
55.900 
56,000 
56.100 
56.500 
50.500 



1246 

1247 

1249 

1250 

1251 

1252 

1253 

1254 



547 
530 
516 
073 
607 
665 
890 
1311 



1255 1300 
1257 1936 



1256 
1250 



1806 

1727 



1260 1629 

1261 1555 



1262 
1263 
1264 



1468 
1413 
1340 



1265 1263 

1266 1182 

1267 mo 
1266 1055 



1269 

1270 

1271 

1272 

1273 

1274 

1277 

1278 

1279 

1280 

1281 

1262 

1283 

1284 

1285 

1266 

1287 

1288 

1289 

1290 

1291 

1292 

1293 

1294 

1295 



909 

050 

005 

857 

810 

774 

737 

702 

671 

645 

617 

595 

573 

552 

536 

515 

496 

467 

447 

427 
412 
397 
381 
365 
346 



577 
576 
572 
536 
532 
529 
766 
746 
761 
712 
718 
715 
713 
717 
717 
722 
717 
717 
720 
717 
717 
717 
715 
712 
714 
705 
711 
708 
711 
710 
710 
707 
704 
700 
695 
694 
667 
663 
669 
667 
6S5 
655 
652 
654 
653 
653 



25.3 
•26.3 
-27.0 
-127 
•224 
•20.2 
-14.1 
-7.4 
•75 
0.0 
•1.0 
-2.0 
-3.0 
-4.0 
-5.0 
-6.0 
•7.0 
-6.0 
-0.0 
-10.0 
•11.0 
-12.0 
-13.0 
•14.0 
-15.0 
-16.0 
-17.0 
•16.0 
-19.0 
•20.0 
-21.0 
-22.0 
-23.0 
-24.0 
-25.0 
-26,0 
•27.0 
-28.0 
-29.0 
•30.9 
•31.0 
■32.0 
■33.0 
34.0 
35.0 
35.0 



50.800 

50.000 

51.200 

53.000 

54.200 

54.400 

40.200 

41.200 

40.400 

42.000 

42.600 

42.700 

42.600 

42.600 

42.600 

42.400 

42.600 

42.600 

42.500 

42600 

42.600 

42.600 

42.700 

42 900 

42.800 

43.300 

42.900 

43.100 

42.900 

43.000 

43.000 

43.100 

43.300 

43.500 

43.700 

43.800 

44.200 

44.400 

45.200 

45.300 

45.900 

45.900 

46,100 

46.000 

46.100 

46100 
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el. Computed p/Wiwoi 
bemoflobin (Hb) 



Oaub«a« of 



Protein tundanis: febbit ou*le CPK 



m Irvcr prnit ma 



929 



P« MSP #GLU 

. ***** 23 4.1 

Kabbit muscle CPK KIRBCM 



Protein Name 



•H»S fLYS #ARG NH2- 
M 10.8 12J 7.0 



*ad human 

Caic ReaJ 
m CPK 




Ht>-t>eta, human HBHU 



7 
7 
7 
7 
7 
7 
7 
7 
7 

7 

7 

7 

7 
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11 
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10 
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9 
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6 


3 
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4 


3 
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3 


6 


9 


0 


3 


8 


9 


0 


3 



7.18 

6.79 

6.53 

6.32 

6.13 

5.96 

5.78 

5.59 

5.37 

.5.14 

4.91 

4.71 

4.54 



•1.8 
-3.2 
•5.3 
-7.2 
-10.0 
•12.3 
•15.5 
•18.0 
•21.0 
•25.5 
-27.2 
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Table 4. Computed p/"i of tome toon proitin, rcUttd to matured CPK pft 



Protein Nam 

Creatine phospno kinase (CPK). ra&bit muscle 
Farty acid-binding protein, rat hepatic 
b2-microglot>ulin. human 
CamamoyWphosohate symnase. rat 
Proatoumin ( serum atoumm precursor), rat 
Serum albumin, ra: 

Superoxid dismutase (Cu-2n. SOD), rat 

Phospholipase C. phophoinosmoe-spebfic P) rat 

Albumin, human 

Apo A-l lipoprotein, rat 

proApo A-l lipoprotein, human 

NADPH cytochrome P-450 reductase . rat 

Retinol binding protein, human 

Actin beta, rat 

Actin gamma, ra: 

Apo A-l lipoprotein, human 

Apo A-IV lipoprotein, human 

Tubulin alpha, rat 

FiATPase beta, bovine 

Tubulin beta, pig 

Protein disulphide isomerase (PDI). rat hepatic 

Cytochrome b5. rat 

Aoo C-ll iiDODrotein. human 

Amino aoc p! assumeo in calulation: 



PIR 
Name 

KIRBCM 
F2RTL 
MGHUB2 
SYRTCA 
ABRTS 
ABRTS 
A26810 
A28807 
ABHUS 
A24700 
LPHUA1 
RDRT04 
VAHU 
ATRTC 
ATRTC 
LPHUA1 
LPHUA4 
UBRTA 
PWBOB 
UBPGB 
ISRTSS 
CBRT5 
LPHUC2 
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in Response dated 
LiUSSN: 09/831,805 

J D D.U6.U of m |„ t r prouim ' ~ I977 

An updated two-dimensional gel database of rat liver 
proteins useful in gene regulation and drug effect 
studies 

^ n h !lr e J. m , Pr0Ved UP0 " thc reference two-dimensional (2-D) electrophoretic 
map of rat liver proteins originally published in 1991 (N. L Anderson « « 
Eiecirophoresis 1991, 12. 907-930). A total of 53 proteins (102 spouTare now' 

massie Blue stained 2-D gels were submitted to internal trvptic digestion 121 
fHPLCrl?, 1 PePUd "' H CParatCd ° y hi gh-P^ormance liquMcirSESpS: 
««« ! ««ng a Perkin-Elmer 477A sequenator Additional 

spots were identified using specific antibodies. Additional 



Figure 1 shows the current annotated 2-D map of F344 
rat liver, analyzed using the Iso-DALT svstem (20 X 25 
cm gels) and BDH 4-8 carrier ampholytes. Both the 
map itself and the master spot number svstem remain 
the same as shown in the original publication. Table 1 
lists the important features of each identification shown 
including the gel position, pi. and M, for the most' 
abundant or most basic form of each protein. Using this 
extended base of identified spots, a series of four 
improved calibration functions has been derived for the 
pi and SDS-^ axes (the first two of which are shown in 
Fig. 2A and B). Both forward and reverse functions are 
derived, so that one can compute the phvsical properties 
or a spot with a given gel location, or inversely compute 
the gel position expected for a protein having given 
physical properties: 



(1) 
(2) 
(3) 
(4) 



^PtATUVM = /m_»*TLIVE« > f -V,$EOlEVCt .DERIVED) 
■**ATLIVER = /,1-»ATIIVER X (P'sEOlESCE-DERrVEo) 
^GEL-DEWVZD = /raTUVIR Y-m, (*«ATUVEr) 

PAsel-derivid = Aatuvir X-,1 (^ratuvir) 

A spreadsheet program (in Microsoft Excel) was devel- 
oped to facilitate flexible computation of pfs from 
amino acid sequence data, and- the results were entered 
into a relational database (Microsoft Access). A table of 
spot positions and sequence-derived pi's and M's was 
fitted with a large series of analytic equations' using 
Tablecurve (Jandel Scientific), and the four conversion 
bqs. (l>-<4), relating computed p/and gel X coordinate 
or computed molecular weight and gel Y coordinate' 
were selected, based on criteria of simplicity, goodness 
or fit and favorable asymptotic behavior. Table 2 lists the 
equations and coefficients. Application of Eqs. (3) and 
(4) to a spot s *and Y coordinates, given in [1], produce 
improved M, estimates, and allow computation of pi 

Correspondence: Dr. Leigh Anderson, Large Scale Biology Cornon. 
«.on. 9620 Medical Center Drive. Rockville. MD 208S0O 3 TuSA all ' 
+301-424-5989; Fax: +301-762-.892; email: leighC.sbc 2m) 

/ Ke £n n, / ,: £ 0 ?r in T i0n ,* If 01 ""*""'"" ««' «»ec.rop„ore«is / Liver 
/ Map / Identification / Calibration 

© VCH VerUfiiesellKtwfi mbH. 69451 W e( „he.m, 1995 



directly in pH units, instead of in terms of positions rela- 
tive to creatine phosphokinase (CPK) charge standards 
The inverse Eqs. (1) and (2) were used to compute the 
gel positions of a series of pi and M t tick marks. These 
tick marks were plotted with SigmaPlot (Jandel) 
together with fiducial marks locating several prominent* 
spots, and the resulting graphic was aligned over the syn- 
thetic gel image (computed by Kepler from the master 
gel pattern) using Freelance (Lotus Development). Maps 
were printed as Postscript output from Freelance, either 
in black and white (as shown here) or in color, where 
label color indicates subcellular location (available from 
the first author upon request). We have also used the rat 
liver 2-D pattern as presented here to calibrate the pat- 
terns of other samples. Using mixtures of rat liver and 
mouse liver samples, for example, we made composite 
2-D patterns that allow use of the rat pattern to standar- 
dize both axes of the mouse pattern. This was accompli- 
shed by deriving transformations relating the fat and 
mouse and separately the rat and mouse Y % axes 
(Table 2, lower half; Fig. 2C and D) based on a series of 
spots that coelectrophorese in these closely related spe- 
cies. These functions were then applied to derive equa- 
tions relating the mouse liver X and Yio p/and SDS-Af r 
(Eqs. 5 and 6 below). The resulting standardized 2-D pat- 
tern for B6C3F1 mouse liver is shown in Fig. 3. 

MOUSE LIVER = /raT LIVER V— Mr C/mOUSE LIVER Y-RAT LIVER y 

(^MOUSE LIVEr)) (5) 

P^MOUSEUVER = /lUTLIVER X-pl OmOUSELIVER X-RAT LIVER X 

(■^MOUSE LIVE*)) (6) 

A slightly more complex approach can be used to stand- 
ardize samples that have few or no spots co-electropho- 
resing with rat liver proteins. In this case, a 2-D gel is 
prepared with a mixture of the two samples, and four 
functions (forward and backward, each for X and Y) are 
derived relating each sample's own master pattern to the 
composite. The required functions are then applied in a 
nested fashion to yield the desired result (using rat 
plasma as an example): 



Rat plasma — /rat liver y-m, (/rat plasma, liver y-ratuver y 

(/rat plasma y-rat plasma .liver y ( ^rat plasma))) 

(7) 

01734835/95/1010-1977 S5.00O3/0 



Eimmtkftiu 1*95. «, S97VI9S1 
T*M« 1. continued 



2-D Database of rat l.vtr proteins 1979 



MSN*' 



Protein IDb) Protein name 



Identification comments 



Gel X*> Experimental GeJ r 1 Experimental 



1184. 1186, 
114. 174, 118 
5. 167. 157 
54.61 
136 



CPSM.RAT 



CATA.RAT 
C0X2.RAT 



Cartumyl phosphite 
synthase 

Caulase 

COX-II 



CYB5.RAT Cytochrome B5 



41 

29 

5. 11 

60 

27 

17 

196 

79 

62.78 
125 

307 



413, 1250. 
933 

133. 144. 235 
8. 23. 1307 
15, 25. 110 
971 

1216, 1215, 90 
256 

415, 734 

80 

227 

134 

18, 35, 226 

175, 251 
1168, 1170, 
1171 
47,93 

236 
320 

152 

1179, 1180. 
1181,1182, 
1183 
55, 103 

135 

172 

277, 56 
50, 1225 
1224 



CK-RAT' 

CK-RAT* 

ENPL-RAT* 

ENOA.RAT 

ER60.RAT 

ATPB.RAT 

ATP7.RAT 

F16P.RXT 

DHE3.RAT 
HAST- RAT' 

HOl.RAT 

HMCS.RAT 

HMCS.RAT 

HS7C.RAT 

P60.RAT 

HS70-RAT> 
HS90-RAT 1 
INGI-HUMAN 

LAMB-RAT*. 

LAMR-RAT* 

fabl.rat 
mdhc.mous 

E 



Cytokeratin 
Cytokeratin 
Endoplasmic 
Enolase A 
ER-60 

Fl ATPase 0 
Fl ATPase 6 

Fro nose* 1.6-bis-pbospbatase 

Gluumate dehydrogenase 
HAST-I: N-bydroxyaryl- 
amine sulfotransf erase 
Heme oxygenase 1 

HMG CoA synthase, 

cytosolic 
HMG CoA synthase. 

mitochondrial (frag) 
HSC-70 

HSP-60 

HSP-70 
HSP-90 

Interferon-v induced 
protein 

I Jfrii p B 

Taminin receptor 9 
L-FABP (liver fatty acid 

binding proteizi) 
Malate dehydrogenase 



58 968 

25 504 



2-D of pure protein; comfirmed by 145336 6.05 181.64 160 640 
^•terminal sequence and AAA 

Internal sequence 2000.81 6.73 499.64 

Ab (J. W. Taanman), confirmed by 45237 4.61 1062.67 
internal sequence 

2-D of pure protein; Ab; confirmed 515.68 4.73 1370.5* 18 493 
by AAA 

Location in cytoskeletal fraction H65.12 5.75 569.09 51 448 

Location in cytoskeietal fracuon 743.11 5.15 605.23 48 W 

Ab (F. Wiumann) 567.73 4>g3 2&3 J7 in m 

Internal sequence and AAA 1399.78 6.00 62334 46 674 

A-Tenninal sequence (R. M. Van Frank) 1184.20 5.77 52331 56.169 

^-Terminal sequence and AAA 629.06 4.95 588.S3 49 620 

Internal sequence 1227.24 5.82 1184.65 22 310 

Uncertain; by comparison wiib ID in 924.54 5.44 737.77 3 8 858 
Garrison and Wager (JBC 257:13135-13143) 

A'-TerminaJ sequence and interna] sequence 18S7J9 635 566.92 51 655 

Internal sequence 1297.94 5.89 86135 32 638 

Uncertain; available data from internal 121939 5.81 915.71 30 423 
sequence 

Ab (J, Germershausen) 1033.48 539 538.13 54 571 

Ab (J. Germershausen), ^-terminal 666.40 5.02 1019.42 26 811 

sequence (Steiner/Lotupeicb) 

Positional homology (with human, etc.) 811.87 5.2? 425.76 69 521 

through coelecirophoresis 

Ab (F. Wiuman); confirmed by yv-ierminal 845.09 5 J2 520.03 56 561 

sequence and AAA 

Ab(FWimnan) 976.11 531 437.14 67 674 

Ab (F. Wiuman) 659.86 5.00 329 90 107 

internal sequence 993.85 5 34 1006.04 27 237 

Positional homology with human through 737.10 5.14 425.19 69 615 

coelectrophoresis, nuclear location 

Internal sequence 534.02 4.77 697.62 

Ab (N. M. Bass) 1586.09 6.18 1483.43 

Internal sequence 1270.85 5.86 861.96 32 620 



4] 327 
16 £22 



GR75-RAT* 


MitconJ; grpTS 


Positional homology with human through 


905.67 


5.41 


413.67 


71 589 


NCPRJUT 




coelectrophoresis 








NADPH P450 reductase 


2-D of pure protein 


824.69 


5.29 


393.21 


75 366 


PDI.RAT 


PDI: Protein disulfide 


^Terminal sequence (R. M. van Frank), Ab 


564J0 


4.83 


528.47 


55 618 




isomerase 










ALBU.RAT 


Pro- Albumin 


Microsomal lumen location, p/, M, relative 


1391.03 


5.99 


446.68 


66 195 


APA1.RAT 




to albumin 








Pro-APO A-I lipoprotein 


Coelectrophoresis with pJisrna protein 


920.41 


5.43 


1137.51 


23 467 


IPKLBOVIN 


Protein kinase C inhibitor 1 


Interna] sequence; homology with bovine 
protein 


1480.01 


6.08 


1458.81 


17 007 


PNPH.MOUSE Purine nucleoside 


Internal sequence 


1507.19 


6.10 


911.16 


30 599 


PYVC-RAT* 


pbospnoryUse 








Pyruvate carboxylase 


Tentative; 2-D of pure protein (J. C. 
Henslee, JBC 1979); reported in Btoehtm. 


1485.10 


6.08 


22332 


131 589 


SM30JUX 




Biophys. Acta 1022. 115-125 










SMP-30: Senescence 


Internal sequence 


721.71 


5.11 


830.10 


34 051 




marker protein-30 








SODC.RAT 


Superoxide dtsmuuse 


AAA; comfirmed by internal sequence 


116U4 


5.74 


1388.68 


18 173 


TPM-RAT' 




(R. M. Van Frank) 










Tm: tropomyosin 


Location in cytoskeleion, 2-D position 


476.24 


4.66 


957.86 


28 865 


TBA1.RAT 




relative to human, Ab 










Tubulin a 


Positional homology with human through 


688.22 


5.06 


537.67 


54 620 


TBBLRAT 


Tubulin 0 


coelectrophoresis, cytoskeletal location 










Positional homology with human through 


621.29 


4.93 


535.48 


54 855 


VIMEJUT 


Vtmenun 


coelectrophoresis, cytoskeletal location 










Positonal homology with human through 
coelectrophoresis, cytoskeletal location 


673.00 


5.03 


53930 


54 426 



ESeampnmsis 1993. /*. 1977-1911 



2-D Database of rat liver proteins 1981 



[ B6C3F1 MOUSE LIVER 2-D PROTEIN PATTERN 
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Pi 



P^fcATPLASMA ° AaTUVER X-pl C/fcAT PLASMA * LIVER X— RAT LIVER X 

(/ratplasma X-RATPLASMA^LIVIR X (Xf^j PLASMa))) 

(8) 

This unified approach, in which one well-populated 2-D 
pattern is used to standardize a family of other patterns, 
has the additional advantage that the resulting pi and M, 
scales are directly compatible. Hence one can compare 
the relative pfs of mouse and rat versions of a se- 
quenced protein in a consistent p/ measurement system, 
and select likely inter-species analogs based on posi- 
tional relationships on common scales. Adoption of 
immobilized pH gradient (IPG) technology [4-7] will 
result in substantial improvements in pi positional 
reproducibility for standard 2-D maps such as those pre- 
sented here; however, we believe that our approach will 
continue to be useful in establishing the empirical pH 
gradient actually achieved by such gels under given 
experimental conditions (temperature, urea concentra- 
tion, etc), in relating patterns run on different IPG 
ranges and using different lots of IPG gels (between 
which some variation will persist). Development of 
rodent organ maps is a continuing effort in our laborato- 
ries 18—10], and results in regular additions of identified 
proteins. Those who wish to receive current rodent liver 
maps, with color annotations, should send a stamped 
self-addressed envelope to the first author. 



We would like to thank the individuals who provided anti- 
bodies mentioned in Table L and R. M. van Frank for un- 
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Introduction 

«t lh: dtfimuon of all open readint „f Md '"""near future will 

JlSStt IS" s,mplt orfM,STO - 

are no, an end in ihenuleves. |„ fac , lhev onlv S a f,™° m ^""n* project 
m ? ihe f U nc,,on of an organism. A irea'rtdl™!^ = s « ,n S P°muo understand- 
co-«prc«,on of •housand s of gt „: s «^T£££ , '*!r ""t '** h h ° U ' ,h < 
^^^^^^^^^ 

Tne most promisinf nucleicacid base KL' * ^ P™«-t»»«l «hnolo«v. 
<L,a„j and Pard«. V Ba ue „ * , £?, ^Z''""? 1 "' **> of mR NA 
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'» an „ presttd S .-n= or pan of a pne. Howe,., i, i ^ Wh ' Ch ca ™P0*i 
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identity all cDNA species, and the approach does 001 easilv aJJow a svstemai.r 
screening. Analyse of gene expression by the study of proteins present in a cell or 
tissue presents a favorable alternative. This can be thieved by use of two-dimensional 
<:-D) gel electrophoresis, qualitative computer image analysis, and protein idemifi. 
cation techniques to create 'reference maps' of all detectable proteins. Such reference 
maps establish patterns of normal and abnormal gene expression in the organism and 
allow the examination of some post-translational protein modification! which are 
functionally imponant for many proteins. It is possible to screen protein* svsi-mati- 
cally from reference maps to establish their identities. 

To define protein-based eene expression analysis, the concept of the oroteome- 
was recently proposed ( Wilkins et aL 1995: Wasinger*/ «/.. 1 995 ). A proteom- U the 
entire PROTein complement expressed by a genOME. or bv a cell or tissue tvrle The 
concept of the proteome has some differences from that of the senome as while there 
is only one definitive genome of an organism, the proteome is an entitv xvhich can 
change under difterent conditions, and can be dissimilar in different ti« ue < 0 f a sin-le 
organism. A proteome nevertheless remains a direct product of a genome Interest 
ingly. the number of proteins in a proteome can exceed the number'of eenes present 
as protein products expressed by alternative gene splicins or with different post' 
translation^ modifications are observed as separate molecules on a ">-D -el As an 
extrapolation of the concept of the genome project*, a -proteome project ' £ research 
which seeks to identify and characterise the proteins present in a cell or tissue and 
define their patterns of expression. 

Proteome projects present challenges of a similar magnitude to thai of °enome 
projects. Technically, the 2-D gel electrophoresis must be reproducible and of hich 
resolution, allowing the separation and detection of the thousands of proteins in a cell 
Low copy number proteins should be detectable. There should be computer *el ima-c 
analysis systems that can qualitatively and quantitatively catalos the electrophoreiicajTv 
separated proteins, to form reference maps. A range of rapid and reliable techniques 
must be available for the identification and characterisation of proteins As a conse ' 
quence of a proteome project protein databases must be assembled that contain 
reference information about proteins: such databases must be linked to "cnomie 
databases and protein reference maps. Databases should be widelv accessible and e isv 
to use. " * " - 

Recently, there have been many changes in the technique* and resource* available 
lor the analysis of proteomes. it is the aim of this chapter to discuss the status of the 
area* outlined above, and to review briefly the progress of some current proteome 
projects. ' 

Two-dimensional electrophoresis of proteomes 

Two dimensional (2-D > gel electrophoresis involves the separation of proteins bv their 
isoelectric point in the first dimension, then separation according to molecular weicht 
by sodium dodecyl sulfate electrophoresis in the second dimension. Since first 
described (Klo*e. 1975: OTarrell, 1975: Scheele, 1975). i, has become the method of 
choice for the separation of complex mixtures of proteins, albeit with manv modifica- 
tions to the original techniques. 2-D electrophoresis forms the basis of proteome 
projects through separating proteins by their size and charge (Hochstrasser et al 
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vital lo ollou comparison of gels from day i 0 dav and hefvwn res -ar-h t, 
factors can be difficult to achieve. ' ™ M,e> Th? ^ 

Crrier ampholytes are a common meant of isoelectric focusin*- for the fir-. 
dtmenMon of 2-D electrophore.s.s. Gels are usually focuscJ ,o equiuhnum to separate 
protein, in the pi range 4 to 8. and run in a non-equilibrium mode (VEPHGE, ,« 
sepaiate proteins of higher pi <7 to llii (OTarrell. 197*: OTanell Goodman Jh 
O'Farrell. 1977, Uruortunate.y. the use of carrier ampholvtev !n t i^™^ 
forcing procedure is susceptible to cathode drift', wherebv pH erad.ents ettahl.oT. 
h> r efocusing of ampho.ytes s.ow.y change with ume < Ri C he„, and d" g-'f 
Carr.er ampholyte pH gradients are also distorted by h.ch *a.t conccnirauon of 
samples .Biellqvist,,,,/.. 1982,. and by high protein load (OTam!!. 19^, .Afunhc 
limitation i< that iso electric focusing gels, which are cast and subject to electroohore 
sis in narrow glass tube,, need to be extruded by mechanical mean> before application 
to the second dimension - a procedure that potentially distorts the gel Nevertheles 
many of the above shortcomings can be avoided by loadins small amounts of ,J C or 
radiolabeled samples (Carrels. 1989: Neidhardt ei aL 1989: Vandekcrkhove ei J 
19901. High sensitivity detection is then achieved through use of fluoronranhv or" 
phoyhonmagins plates (Bonner and Laskey. 1974: Johnston. Pickett and Barker 
1990: Patterson and Latter. 1993). However, this approach is onlv practicabf for 
organism, or tissues that can be radiolabeled. 

An alternative technique, which is becoming the method of choice for the first 
dimension separation of proteins, involves isoelectric focusinc in immobilized »H 
gradient (IPG) gels « Bjellqvi.st ei aL 1 982: Gorg. Postel and Gunther. 1 988- Ri-henV 
1990,. Immobilized P H gradients are formed by the cova.en, if 2^ 

gradient into an acrylamide matrix, creating a gradient that is completed stable with 
time. IPG gels are usually poured onto a stiff backing film, which is mechanic v 
strong and provides easy gel handling (Ostergren. Eriksson and Biellqvist | 988l The 
major advantage, of IPG separations are that they do not suffer"from cathodic drift • 
.hey allou focusing 0 f basic and very acidic prote.ns ,o equilibrium. P H cradicnts cm* 
"He precisely tailored (linear, stepwise, sigmoidah. and that separations" over a vm 

?;:," nV ,L/ an c" POVSiWe ,0 W5 PH Uni,> PCf Cm » ,Ri i- hc,,i - Bicllqvis, „ ai 

• T' d Smh:i ly9U;C ° r ^'« / " "** Gelfi cn,L l^LL^, !,' 
1988,. Houevcr. n ,s no. currently possible to use IPG ceN m , s Cparalc Nerv h , • ! 
pro.e.ns of .soelecmc pom. grea.er than 10. although «h,s i, under development 
Nurrou pH range separations are useful to address problems of pro,ein co-nii.' ra.ion 
in complex samples, allowing 'zooming in' on recions of;, ..-el if,' lw ^ {pn ,. c , 
strips are now commercially available, which begin «o addres's «„ s problem of in L 
and in.er-lah isoelectric focusing reproducibility. 
There are two means of electrophoresis for'the second dimension separation of 

lZ!"Lr ttf h0n20rUa, Uhrathm ?CN <Gbr - Pwel - anJ Gumhcr. 
1988.. Bo.h are usually SDS-con.aining gradient gels of approximate 1 1 <7 M o IS <7, 

acrylam.de. which separaie proteins in the molecular mass ran»c of 10 - 1 S()kD A 

stacking gel is not usually used with slab gels. bu. is necessary when usin- horizontal 

ge setups (Gor, Postel and Gunther. .988,. Comparisons have shown that hTc 

Imle or no difference ,n the reproducibility of electrophoresis us.ne either approach 

(Corbet, c, aL 1 994a,. but commercially avaUable vertical or iK^inJ^S^ 

will provide greater reproducibility for occasional users. For slab eel elcctropho e^i 
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: rro.cn. cmmon iu each cel. (A . W,d e pi ran.e m„ tim^Tul J T R,npv hl ■ l:h,, • ^h, 

protein* F.rs. d.mensmn se'nnrauon 'arhe "5 uMncTn ?Z k i ! m3r hum:,n 

Tnc «™d d.men„nn »,* SDS-PACE Anual ceU, J Z ^T t PH PiM,,em of ' 5 M » 0 un »> 

r^mamap T h ehr, ldime n.^ Ch. 

wnnd dimension ui.t 5DS-PAGE Mirmnr^ri..;.. i i P *- rjUlcni ° r -~ ln — and 

catiljs s>s, OT i has been shown » gjve battr resoluiion and hither «„<mvj,! 
d«n» ,Hoch«rasse, and Merri.. mt. Hochs,,**.,. Pa-cho^nd M^L 
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Notwithstanding tht advances described air \* there is jn inrr ■ 
improve the reproducibility of :-D electrophoresis* faciJi-ate da^L d '° 
and proteome studtes. Ha^ton „ al. „99J, explain ^iSlSriS^SS 
protem spots, and there ,, 99.5* spot matchtn* from pel to & lh f< lviIf pro ™ 
spot error, per gel. Thts amount of error, whtch might accumulate w.th each .el to 4| 
companson used ,n database construction, could produce ^ unacceptable Crf 
uncena.nty ,n gel databases To address these issues, panic! automation of 
gel separation, has been unoenaken (Nokihara. Moritaand Kuriki I ' 
a al.. 1 993 ». Althoueh result are preliminary ,r», Vn .™ „ , Hamn - 8,on 

in one studs was found to be threefold T'Tl Te * T0 *» ciM «> 

„/.. 1993,. 1, should be noted that JKt^f^^]o^ M ^ S ^ r, 
i . - • ^ S CI lormau :M) x 4? mm) have h»»* 

almost completely automated (Brewer „ a,.. 1986,. althoueh these are^ot^enerJl v 
used for database studies. -.cncrjii} 

MICROPREPaRaTT\"E 2-D gel electrophoresis 

With the advent of affordable protein microcharacterisation techniques including * 
terminal »'™eque^ 

analyst* and monosaccharide compos.tional analysis, a new challenge for^b 3^ 
phoresis has been to maintain high resolution and reproducibil.lv but ,o IZ T 
P--insuff,,entou^ 

quantn.es of protein; per spot,. This becomes difficult to achieve with verTcomn^ 
samples such as whole bacterial cells, as the initial protein load is divided amon^ 
o 4 000 protem species. Two approaches are used for product amounts Tm*^ 
that can be chemica ly characensed. The first method is to run "muMpS 
and pool the spots of interest, and subject them to concentration < Ji „ al 1 994 • wi* 

also ac, as a purification step to remove accumulated electrophorctic contam „^n ' 
,uch a> glyc.ne. A more elegant approach has been to exploit the hich loadin cTpa^v 
of IPG isoelectric focusing. The hish loading caDacnv of i mmft u' , VI* Cjpjut > 
was - escr* ed early ,Ek. BjeHqv.^nd 

mg of prote.n can been appl.ed ,o a s.ngle gel. y.c.dmg microsram quantu es of hun 
dreds oi protein spec.ev A further benefit of this approach is'that oroirTn, 
low abundance^h.ch may no, be v,ual,ed by uZr inW^n^ch" 
.o be detected. The use of electrophoret.c or chromatographic nre fractional -I? 
mques , Hochstrasser « uL 1 99 1 a: Harrm-ton et al I 9<T \ I followld hv I "7 t 
^™nge^^^^^ 

studies on proteins present in low abundance. * ,( 

Methods of protein detection 

d^aTJ hv a f n> m£UnS / i° r , de,CC,in - c P ro,ein > f ™ ="D ?cls. The method used will be 
d.cta«ed by factors tncluding protein load on gel unalvtical or P ren, ra ,ive, ,h! 

tiTl° b ' ' f0r Pr ° ,C,n qUamila,i ° n ° f f0f b,0ttin * "™» eh^K d^rtS 
•on,, and the sensmvnv required. The most common means 0 f pr0lein de ,«U 0 " „d 

thetr appl.cat.ons are shown in Tahle I. Most detection method'have 
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example some glycoproteins are not sunned by coomass.e blue (Guldhcrs „ „/ 
k T/ 7 3n,C d> ' eS ^ UnSUhab,e f0r P ro,e,n dc,cc, » on o» P V 'DF Samples 

Although most means of protein detection give some indication of the quantities of 
proem present. ,n genera, they cannot be used for global quantitation. This is became 
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no proicii. <ta.n is able con.-.istently 10 detect proteins over a rjn8c of . nn .. 
.ions, isoelectric pom.c and amino acid composition., and wif„ of 

.here art lurse d.f.erences , n statning patten, when identical celsor bio sareTuh^' 
i« differs <«a,ns. mcluding amido black. imidaioJe ml i^tl^^V 
co.lo.da, gold, or coomav.e blue tTovev. Ford and Baldo. %S 
i-he rr.o .. common means of quantitative large number, of protein ,n 7* n ' " 
involve, th- rudiolabcll.ng of protein sample* prior to el JLT "* D 
quanrta.ion based on fluoroeruphv and ima- JSZl o, e ^ lT ° phoK ^ »nd protein 

meth.on ,e canno, be detected ifon^S^^^ ^ 
BLOTTING OF PROTEINS TO MEMBRANES 

Electrophoretic blotting of proteins from two-dimensional polvacniam.de ,*k ,« 
membranes present manv options for protein idemifir...;^ u, -. Jcr > eels to 
which are no, possible whence.™ SttSESTT*"^ 

r polyvi " v " d£nt dmuoride ^-J^^Sbkk? 

letmuul sequences, amino acid analysis, or immunoblouin- or ta»ST i 
.o endoproiemase d,«Mion. monosaccharide analvsis pho^l an^ . 't" 
mnirix.a.srs.ed laser desorprion ionisaiion mass ^c,™^ ^ ' ^ 
Uilkmsw,,/.. l99;:JunsWut<-«i/.. 1994: Sunoncr,,/ 199V R,<L, t - 

pos-.ble ,o comhrne of some of .hese procedures on a sintlc pr„,ei„ spo „„ a'pvDP 
membrane .Packer ,, „/.. I 995: Wilk.ns „ „/.. suhmmedAvcUandlc^, l^f 
Thi. is useful when minimal amounls of prolcin are iv.ii.M. r ■ ' Wl • 

■echn.ques will he explored ,n de,a,l la.er J h « Z Z w h SlT ' T' 
■here are some disadsamases associated «i,h bTol In* o 1 * "* ^ 
There is always loss „f sample durin* hlol.. 1c d "s . Ecfc T*"^ 
IV9JL and common P ro,ein deiecon m*£^ £^J?j£^ 

:-D gel analysis, documentation, and proteome databases 

Following protein electrophoresis and detection detailed m-.K ; «r . 
undertaken*,,,, computer system, For proteome -t^T^^TT^ 
to catalogue all < P o,s from the 2-D eel in a qualiLile a^nd f nlih 1 ^ * 
man„e, so as to define the number of prote^ 

Reference gel images, constructed from one or more eels form th, h P T™' 
d.mcns,ona, ge, databases. The, databases a,o JL£^£ 
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(Nfftdhardt „ « StmpsonT^ ~ 7^^- **~ 
database, contarntnj DSA sequence data. chromoLa, £ o.4 ' 
D f els and protem funconal mfc™,,™ f„ r „ or?anisn , JJ hecomt^H J*' 
a' jenome and proteome pro.ects propess ( VanBotelen « ,,/ 1 09™ v tMJ ™ ,shed 
Database cited inGarrelse;,,/.. 1904,; •"«"»"«» "<"•• 199.. Wt 
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nhn/nh ^^P" 0 '"" * nd by st:unin £ . fluorosraDhv „r 

phosphonmagtng. images of gels are digitised for computer an^lvs " hv P ' 
scanner. laser densuomer. or charse-coupled device (TCD) LmeA. r ? ' ma?C 
Celis a aL 1990a: Uru tn and Jackson I »T ? ,7 , ,CaffCk ,989: 
re S o, Ut io„ofl00-200mm.a„dca„de^^ 

P-I.rn.onv to remove ven.cal and honzomal streaking J 
spot posmon; and boundaries, and to calculate spot .ntenvi.v , r , dC,CC ' 
<po. ,SSP. number, contammg v cn ,cal and l«5^fc J f U ^ 
assigned .o each detected spot and becomes the proZ 2* ,nf ° rrTUi,,on - * 
l«« -me notable son usages whTch Jo^ D ^1 ^ 2 



Table 2: S..m c Software Packages for ihe Analysis of Gel Imaces. 
Gel Imjjre Analysis System Reference** 
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Andean „„/ R, chaflKnn . H „rn ;md Anders. 



^ • h-sr referent arc nn, f ,na usmf . lhf) )n ., ucjf ^ ^ ^ ^ ^ ^ 



As there arc difficulties ,n the electrophoresis of samnl-, „„h i(m<- „„ , 
>i>. reference itl i KIS are often construe,,,! f JL. , _T , reproducibil. 

:0OO ,o 4000 protetns from one £ ,l J^^^Z *T """^ « 

n-anuall, des, ? „ a ,e, appro,. m a,eh 50 o, so !pZ 1 Z ^ ^ 

he cross-maiched. Proiems which match are ITw V ■ ljniimarks ° n 
»'tat computer-based vector «^^^^^? ed ^ lMlhna ' k - 
Close ,o 100* of spots f rom comple^a^ ^ * IZZT ? 
alihoush different decrees of ootrato, in.lr!. ^ d bv lhtst ""'hods. 



28 



Marc R. Wjlkins ei a!. 




JenMrnmcter. (BiCei .mace after Dro««,nC^ ™" * V ^Z^' pC ' ,ma * e as ca ' Mure ' , "> «r 
ol all s P0 ,s on ,h e £| " P 10 m " on W " k,B ' a " d ground. .C» Ou.l.ne dcfim.mn 



Pntfrtsx with pnnvimie /»r/ Vm « 2W 

Calculation of protein isc^sCTri. - point and molscilar ubght 

Estimation of the isoelectric point tpl: and molecular ueichi f MVVi 0 f prm-in fro 
:-D cel. povidss fundamental parang for each protein, which arc also of usT 
dunnc identification proccdur-s .see fallowing <eciioni. The pi and MW of protein* 
art recorded in 2-D eel databases. Accurate estimations of protein pi and MW - an hJ 
obtained by using 20 or more known proteins on a reference map to construct standard 
curve* of pi and molecular, weight, which are then used to calculate estimat-d nl and 
MU of unknown proteins (Neidhardt et aL I9S9: Garrels and Frania. I9S9 Van 
Bogelen. Hution and Neidhardt. 19%: Anderson and Anderson 1991- And-/ ™ .", 
a, 1 99 1 : Latham „ aL. 1 992 , A.ten,,,,ve.y. the MW of individual prolc^NouL 
to P\ DF can be determined very accurately by direct mas* spectromcm . Eckcrskorn 
et aL 1992.. Where immobilised pH gradient* are used, the focusinc position of 
protein, allow, their pi to be measured within 0.15 units of that calculated from th* 
aminoacid sequence (Bjellqvj W.. 1993c I, must be noted, however. lhal prp , e * 
carrying pos.-translational modifications may migrate to unexpected pi o n im- 
positions during electrophoresis (Packer ««/.. 1995). 



SPOT QUANTITATION AND EXPRESSION ANALYSIS 

A major challenge faced in proteome projects is the quantitative anah sis of proteins 

,?,T ' •? Sle ™* h ™^ The ™« ™»™ ^ans of protein quantisation is 
to deierm.ne chem.cally the amount of each protein present hv amino acid com- 
pos.t.onal analysis. However, the current method of choice for quantitative analvsis 
of many proteins is to radiolabel samples with [»S] methionine or "C amino adds 
perform the 2-D electrophoresis, and measure protein levels in disintegrations per' 
minute .dpm> or un.ts of optical density. Quantitation is achieved cither hv l.uu.d 
scintillation counting, or b> gel image analysis where spot densities arc quantitatcd 
by reference io eel calibration strips containing known amounts of radiolabeled 
protein or agains, the integrated optical density of all spots visualised < Vamlckerkhove 
et aL. 1990: Celis „ „/.. 1990b: Celis and Olsen. 1994: GarreK 1989 Lath n 
GarrcK and So.ter. ,993: Fey ,„„.. ,994,. A„ approaches effect aHow^ o 
ne normalised against the total disintegration, per minute loaded onto ,h- ..,1 
Limitations that remain with radiolabelling methods are that absolute quant itaiu>n is 
no. achieved because all proteins have-varying amounts of am amino acid and thit 
only eas,|y labelled samples can be investigated. Quantitative silver stainin* presents 
an alternative .G.ometti ci aL. 1 99 1 : Harrington c, aL. 1992: Rodr.-uez etui iwv 
lynck aL. 1993,. which when undertaken with PSJihiourca, Wallace and Salu/ 
199. a .h, , s 0 f extremely high sensitivity. 

When protein spots from samples prepa'red under different conditions arc quantised 
and matched from gel to gel. it becomes possible to examine chants and patterns m 
protein expression. Large scale investigation of up- and depreciation of proteins 
TZr PP "hT CC "I d,Sap P Car;in «- »* undertaken. For example, simian virus 40 
.ran formed human keratinocy.es were shown to have 177 up-reculated and 58 down, 
regulated proteins compared to normal keratinocytes (Celis and Olsen. 1 994 ,• detailed 
synthesis profiles of 1200 proteins have been established in 1 to4 cell mouse embrvos 
. Latham « aL. 1991. ,992,: and 4 proteins out of 1971 were found to be ma kers fo 



30 Marc R. Wilkins ei al 



cadmium lOMci.y in unnan proteins <Myri ck eJaL , 99 ,, - 

anc' P. Mck -Larsen. Phonal communication,. Impresshelv. laree «el ii^S 
pro.ein expre»,on under different condition* can be sJobalh W<t, Wus£ 
" " " W of related objects wilhin a For e* ^ £e 
REF.2 ra, c,ll hne database, consisting of 79 5 eis from 12 experimental *roup! vh re 
each eel contains quantitative data for 1 600 cro«.mairh#d nr«,^ k Z P 

rT . vere induced or repressed ,ni ^^JJ^^ 
tran.format.on. su^tinc a common mechanism Protein crounc.K Jdcnpv,ru> 
orrepr^edJurmcculturesrouihto^^ 

po.ent.al ^investigation of cellular control nidj^ii^*; 
immense, h » equally clear that inventions of cene expression of tnt ? 
currently .cchnicaily impossible usi„ S nucleicacid baTed fSS^ " 

Table 3: Siimc prmcomc database^ and their special features 



ProicMnu* catanase 
£ irrnc-rrntcm ca:ar-a*c 

Htinun hran djiaha*x* 
Hum.m kcraiimu'ytr database 

Mi'u^- *.-n:rr\o dat;ina*c 



Special features 



Mtiu*v U\cr daiahave 
* Arcunnc Protein 
M.ippini: Group » 

R.n ii\rr epithelial dnuhaxc 
Rj: I iv cr database 



REF <: rai well Imr daiahasc 



SWlsS-rDPACE conta.n.ns 
human rctcrenjc maps 



Vl - Kl Protein Database ( VPDi 
and Yew Elccirophnrciit; 
Protein Datahasc l YE PDi 



Cci vpoi* linked unh GcnBank 
and Kohara clones: quanmauvc 
spot measurements under differ- 
cm trruuih conditions 

identification oi disease markers: 
iuo separate database* have 
Keen established 
Extensive identifications; 
quanmauvc «pni measurement 
of transformed cells; identified* 
imn ill disease markers 
Quantitative spot 
measurement through 
l m»J cell Mairc 

Documents chances due to 
exposure m mni/inp radiation 
and m\u chemicals 

Dciailcd subcellular 
tract inn ji ion studies 

Extensive siudies on reputation 
ot protein* h\ drup* and una 
accms 

Accessible via World Wide W'eh 
quantitative spm measurements 
under different conditions 
Accessible via World Wide U eh. 
c»mplcte»> intecratcd with 
SWISS- PROT and 
SWISS- JDIMACE 

CompicieK crossrclcrcnccd 
orcanivm database; YPD has 
estenstve information on over 
35CK> proteins; YEPD has 
man> idemi Heat ions 
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FEATURES OF PROTEOME DATABASES 



Progress xxith /univninc /»„.„., , x 



Proteome projects rely heanly on computer database* to More information about all 
protems expressed by an organism. 'Proteome databases- should contain d-u ltd 
information of prot:m> already charactered elsewhere, a, u c |J a< protein data" from 
:-D eels such as apparent P l and MW. expression level under different condition* 
subcellular localisation. anC information on post-translational modification* | mve% ' 

°„ f re ,!TT e ;* D ? f ,s ; s , l,c .T in f pro,ein ssp numbers antl p rotein ^min^u^. 

should also bt 'included. Ideally, proteome databases should be acces^hlc uuh 
Mac.ntosh or IBM persona computers and easy i 0 use. Some proieome dataha*ev an J 
the area, ihev cover are |- M ed .n Table 3. Databases range from collection* of 
annotated seU to large daia-»a*es of images integrated with proiein and nuclei - acid 
sequence banks. 

One example of an integrated proieome database is the suite of SWISS prot 
SWISS.2DPAGE and SWISS-3DIMAGE databases 1 Appel,,,,/.. ,99V Appel r 
199.: Appel. Ba.roch and Hochstrasser. 1994; Bairoch and Boeckman n : 19Q4 , The 
features of these three databases are listed in Table 4. SW1SS-PROT wi« 
2 DP AGE and SWISS-3DIMAGE are accessible through the World Wide Web 

Table J: Tht SW1SS-PR0T. SWISS-IDPACE and SWISS-.^DIMACE *imc ol cr..~hni., 1 1 w 



SWISS-PROT 



SWISS2DPACE 



SWISS-3DIMAGE 



Information Texi entries of sequence data: 
Citation information; 





taxnnomu; data. 3b. 303 




entries m Release 2v 


AnniiiaiMinv 


Prmein function. 




Prut iranst.iiitinal 




modifications. 




Domains: 




Secondary Mructurc. 




Quaternary structure. 




Di*ca*e*» actuated 




u nh pruiein. 




Sequence conflict* 




SW1SS-2DPAGE 


Referenced 


SWISS- 3DIM ACE 


Da inhales 


EMBL. P1R. PDB. 




OMIM. PROSITC! 




Medline. Flyha*e; 




GCRDh. MaueDB. 




WonnPcp. D»ei\DB 


Other Features 


Navigation in other 




SWISS dataha*et achieved 




h\ selerunr entries w»h 




computer mouse 



2-D pel tmace* of: human 
liver, plasma. HepGl HepG2 
*CLT«cd pmtcinv red hlood cell, 
lymphoma, cerebrospinal HuiJ. 
macrophage like cell line. 
enthroleuLemia cell. platelet 
Gel tmace* where 
protein is luund. 
How protein identified. 
Protein pl and MW. 
protein numher; 
normal and patholnyual 
variant % 



SWISS- PROT and all 
other dntahases 
acetyl hie rhmuch 
SWISS-PROT " 



Gel imaces shim poMiion 
of identified proteins, or 
region of eel where protein 
should appear 



Cnllccnon i if 330 3-D 
imayes ol proteins 



All rtnmn.it ion ,^ 
;i\;itl.ihle in SWISS- 
PROT 



SWISS-PROT and all 
other dauih;i*c* 
acce*Mhlc thrnueh 
SWISS- PROT * 



Mom* and Mereo 
imaeev available. 
I mace* can he 
iranMcrrcd to Wical 
computer image 
viewing program* 



< BerneivLee e i aL 1 992 1. aJlou jne an v 

the ,,orcd tnformation and inu^N^S^^ * iWM ™ » 

i< a< all postal ai^T^S^/^^ ******** 

car ^e,ec lsd u-n h aco mputcr ^a^^~ ^ - d 

,nu „ can be viewed if k „ou^. £ ff£ S^J^ « • «<-ce „, 
available. References ,o nucleic acid and oiherTaThase < - ** ** SC?n if 
.-iccess to information stored elsewhere. ?,ven 10 P rov >de 

Organism' databases, containing detailed nroi#.in 
ur». yI a spec.es. are becoming common T^Z^ K 

Tl«^diffcrfrom nucleic arid orprweinwnle^.^ P T" K pr ° jeC " 

PROT because thev are ima Ee ^r^^^^ h ^ C,eMotS ^ 

map posittons. transcription' of 'enes 1 ' nforma «' on ^ou, chromo^oma, 
c WA* «,/, gene-protein dataS mV»bSS u*™™ ^ *" 

VanBo,ele» and Neidhardt. 1991 VanS T Neidhar *- ,99 <>: 
EC02 DBASE, is one example l,LtaTn!Xn e and' <,/ "- l992L kBW " * lh < 
-nformation .including P l and MW « T"? 2 * D ^ «l« 

mation (GenBank or EMBL codes. chnJS^ ,de " ur ' CJl ' on '- *eneiic infer- 
• Kohara. Akiyama. and Isono. 1 987, 2 c,S Dn H ^ '"T 0 " Koh ^'-e, < 
regulatory information (level of pLei ^2^^?^ * enesK and 
member of region or st.mu.onf AH ^ Tm 

referenced to the SWISS-PROT database (Bairnrk ™; DBASE ar * al*o cros.. 
anticipated t ha, onanism database ^Zll^t ' ,994 '- ,l " 

available informal about a plnicl^ 

consistent manner in which orsani.sm databases are a.ssembw, Vu CUrrenl,y no 
comparisons in the future. J-vsembled. which ma> hamper 

Identification and characterisation of proteins from 2-D o e , s 

The number of proteins identified on a 2-D reference mun, i.,., 

a research and reference tool As most referent u ? dClCrm,nev ,,s "*fulne>. as 

protens identified, a ma.or a m of a In, ' °" " " ^ Pr ° P ° nit,n " f 

fmm 2-D maps. ,n order to deAn Z ZZ " '° 

databa.es. or as unknown ^ ^ CUr ™ ™*ic and and P ro,c,n 
open reading frames, and prov.des 0 ^ f D "a £'1'" ™ finnSU,l,n DNA 
characterisation efforts K pointinc to nmLn li,, ^ f a " d prnie,n 

3000-4000 proteins from a Li ""d man thT ,1 "! "'I?' ^ ,hcre ™> bt 
pmtein screening ,s , 0 id«j£J£^ «* in 

Traditionally, pro.e.n. from ^-D eels havn h ■ , T'™" 1 ° f COM and cfft '"- 
-munob,ot„n ? P N.termmT, ~ ^ ^ 

com,gration of unknown proteins u,,h Inown X ems 0 K SCqUCndn - C - 
homolosouscenes of interest in ih^nr„ « . lcins - or h > overexpresxion of 

««/.. 19941. Whita lhesc , tchni Q ^ ^ " ^'"'j, , 9 ' J ' "?««««'.. IW3:'CancN 

^ l0 mas5 pro , tin , Je „ liricaii : n :x; 



Title 5: Hierar?h.ral analyse far mas; screemnc *f ;.D waia^d m 
exrtnsnr ^ a* nrn u,ed ,f nc^sa, TaNr mndL?^^^ ^ 

Order 



Ammn acid a'ul.. w unh N-ierminal sequence ta; 
Pepnde-ma^ tirecrpnntmc 



Combination ol aminn acid ar.ai\si< and pepude 
ma>* finerrprinnnj 

Ma« src*iromeir\ sequence las 

Extensive N-termmai Edman micriKequenctne. 

Internal peptide Edman mjcriKequcncinf 

MicrovtourncinE h\ ma« speirirnmem iele;-trn- 
spra> Ktnisanon. pcw-Mturcc de=a> MALDMOF) 
Ladder *euuen„ing 



Jimplui ciuL I wo;. Slum |wg; 
Hc»N»iim. Houitueu- anj Sander. )™i 

W,lk, n * a «/.. ^ghmmed 

£"^'7'' ' » W. *TPm. H.Mr„ r ;fnJ 

Mann. Ho,ru r and R-cr^uirit 

^ mv.VMnn,,.,,,; |yi ij 

•Sutton n ol.. IW5 

Corduell r/o/., 1995, 

Mann and \Vil m . ivvj 
Matsudatra. IVK7 

Rmcnfeld rt aL ivv; ; 
Hcllman a aL lvy.v 
John.ton and WjUh. 

Banlei-Joncv r/ |ggj 



alternative to traditional approaches iTublc 5: W a sj ns er«<,/ ioo<, T k- • . 

use of rapid and cheap identification tools such as am.no u c d a n a «! '"T^* ^ 

mass fin ? erpnnt,n ? as fi r « ste ps in protein idenUf^^ 

slower, more expensive and time consumine idemiHcation oroc-dn J ' ° f 

ihe construction of this hierarchy the analv time cos, Z S ' f k ncce * sar > • ,n 

of the dat. created has been considered a u hi T^t^ 

machme „me per sample, the c^^T^fe ^ m < 

con<„min F . Ammo acid analvsK and peptide ma sJl^T I ^ imd ,ime 

•cchniques in the hierarchv a're disJSt^^T^ ^ 

•denttneatton technics ,n TMr , see Pat tertoi *\ 994 1 an dl lan n nw?" ^ 

PROTEIN IDENTIFICATION BY AMINO ACID COMPOSITION 
There ha> been a revival of interest in ih* «r 

■demifenor, of pro, t ,r, from 2 d1 f, r J,v C0 '" P ° Ni " 0n '"' 

The amino acid composition of proteins can k» ,u L_ ^ . ' " n t,JUh *>cs. 
rad,oH W „„, Md 0 l, iuIlvc ZZZ^^g^T"? r U,b0 ' iC 
«/.. 1994 Frev <v n/ I0Q4, rtr k, -m i. j . " elCttro PhureM^ «Garrcls £ -/ 

ch-.-w™^ i, 0 ^; ^ c ; u ,t: m ;r/c": en,hranc - b i= o,,tjpr,, ' c,n - nj 
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A»r.: 12 .5 Z'.x : Ser : 5.7 wj g: f 7 

My: 5 4 Thr: 2.6 Al*: 6.7 Pro: 

T>T: 1.3 ATS: 5.0 Val: E.O M«i: C.3 

5.9 6.0 Ph«: 13.3 ty, . « 4 

pi «*tiia«te: £.59 R«no« searched: I 6.64, 7.J4) 
«w as^sa:*: i«8:c R*not tcarcacd: (13440. 20160} 

Clstes: *w:s=-??.rr e.-.ine* « 3 - tht special Ercil matched Sy aa s s'- en- 
Rank Sccrt Prstein pi Mw l»*sripsion 

1 34 PTK.ECOU CM „,„ A»«TA^"cAJUUXCm.TJUU«rriU« 

; i! 5r^::r: f 36359 ***wtkwati kinase iec iT^s, 

« rA=r.rr=i: 5 .S2 57 8 i2 TNuscRxmcrca activator cl£^ 
s « H^rr_r-i: g.ss 19759 hekolysxn c. flasSd. 

rL^"" SVf:S£ ?R=r er ~ iM i8r vitk «* value, in .p.-ji.a 

Ran* Srrre »r::ein pi Mw peasription 

J " !!^-!"rf « V 1 "" "'^^"cMuiuonrxTiiAnsrauLK 

2 :s2 TRjs.rrri: 6.73 17921 thaj protein. 

3 :i2 YA.-;.rrr.: 6.79 1902a kypotheticai lipoprotein yajs 

< 140 YrJB.-D:: 6.B3 149« hypothetical 14.9 u> protein in CRPE 
5 142 YAKAECO-I 7.06 14726 HYPOTHETICAL PROTEIN INBETT VAS3ION 

Ficure 4. Computer printout Iron, ExPASv server where the empirical am.no acid co mp ,K,.„ m 

5U ISS PROT i.,r £ . The direct ident.ruai.on. aspartate carhamm h»n«fcra«c. .» sh..» ., ,n H..U1 L.m 
*.-re. ■ndi.-aic j p.* .J match N,.,c h. .» mau-lunj » uhin a dclincd P l and MW ranee i km er *ei ..| , 
to. :•::,.!> .nerved the score d.Hcrcnec beiuecn the f,r« and second mnkine protein. Th,» wl' 
dmerciwe ene* men k .ml,dencc m the idennAeauon. and » onh oKcrxcd «l m the i. T rank.n- protein 
i» the corre.i identification i W iMms a til.. I W5i. -"..in, protun 

graph> -based anal> ms. Protein, hlotied 10 PVDF membranev can be hvdrolvscd in I h 
ui 155 r C. ammo acid^ e.xiracted in a single brief step, and each sample automatically 
dcnvatised and separated bv chromatography in under 40 minutes (Wilkins ci af 
I S>95: Ou ct al.. 1 995 1. In this manner, one operator can routinely analv.se 1 00 proteins 
per week on one HPLC unit. This technology lends itself to'auiomation. and it is 
anticipated that instruments u ith even greater sample throughput will be developed 
When proteins have been prepared by micropreparauve 2-D electrophoresis ( Hanash 
<•/ «/.. 1 99 1 : Biellqvist ei al.. 1 993bi. blotted to a PVDF membrane and stained with 
:im.do black, am visible protein spot is of sufficient quantitv for amino acid analysis 
• Cord well ei al.. 1995: Waving er ct al.. 1995: Wilkins ei al.. 1995). 

After the ammo acid composition of a protein has been determined, computer 
programs are used to match it against the calculated compositions of proteins in 
databases ( Eckerskom ei al.. J 988: Sibbald. Sommerfeldt and Ar»os. 1991 Jun«blut 
cv ,//.. 1992: Shaw. 1993: Hobohm. Houthaeve and Sander. 1994: Wilkins ct al 
1995). Matching is usually done with only 15 or 16 amino acids, as cysteine and 
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Rank 


Sccrt 


Prcteia 


pi 




a 


21 




8.03 


45318 


2 


22 




5.86 


3C502 


3 


38 




5.78 


45774 


4 


44 




S.8€ 


48018 




45 


DHS4_SZ£12 


5.98 


46581 


6 


46 




5.79 


43765 




46 




5.78 


37851 


e 


47 




5.98 


49162 


9 


4 * 




5.85 


43290 


:c 


50 




€.01 


370€4 



Atx: 5.* CIx: 10. 8 Ser. 4.1 Hit: 2.7 

Giy: 12.2 %r: 2.8 Xla; 11.9 Pro: 3.2 

T-/r: €.f AT3: 3.7 Val : 9.5 Mat; C.« 

5.1 L«u: 8.2 Ph«: 3.2 Ly»: 4.9 

p: astir^t*: 5.99 fcangt starched: ( 5.74, g 34) 
Mv t«un:t: 45000 Aar.?t ••arched: (3«OOo! 54000) 

Cl«.« TCSS-fk- fee EC^: w iih pl and val.„ 4a ^ Si . lrt 

KJX2X 
K S K S X 
K a J x y 
K O Q T y 
K A I E B 
H H H S L 
HUM 
M S S X L 

nn: 

Fijrure 5. A PVDF pnuein «p»i (mm 3n £ ( ul, ;.D rcicrcn - c man »a 

same sample .hen <uh,e» ,,. ammo ac.d 3nal> MS . The N,cnn 1M | se^cnt-eT^M L K R ^r^h™ 11 "* 
a. id Lom r ,K,i„m ul ihr spm. av well a» e«.ma.cd pi and M\V *err mat -Li „ * am,n " 

for ,h,Kc e„,r,e< The ,n P r.,n k ,n P rf^2"J^SS 

larpc >.,.rc d„,cre«.x hemcen the firn and ,ec«nd rankms ^.ri^h J? 1^ Jk a 

.he crrc. pr.,,e.n .denufu a „„„ However, ihe «cuucncc tar iM L^Tftf r L ' Bcc ,n N»*»nj» 

rr.ue.n <cr,ne hydmiyme.rnl.rans.crasc. F L K *' " ,n, "™ J »»* *lcmii> „, lhc 

tryptophan are destroyed during hydrolysis, asparaginc and diamine arc deamidncd 
xu corrss P ond,n ? *Ms. and proline is not quantised in some anah si. ^ s 
The computer programs produce a list of best matching proteins, which arc ranked hv 
a score that .nd.ca.es the match quality. Some programs allow matching U , he 

;;^ C ^;° SPCC '? C ,'uo- d ° W V ° f MV "' Pl ,H0b ° hm - H "" ,h ^ *>d Sander, 
r; til.-i 992. U ,lkm, « a, 1 995 ,. The use of such restrictions im ^ s , hc 
ma.ch.ng. An example of prote.n identification by amino acid composition is shown 
m Fwvr 4. To date, am.no acid composition has been used to .dentilv proteins from 
reference map. atSpintfatma mclli/erum. Mycoplasma gmuuhum. t mti Sac , ha- 
™nc« cewsiac. DicnnMcliam dmoulcum. human sera, human hcari. human 
lymphocy.e. and mouse brain (Corduell a aL 1995: Wasincer <>i ol 199V WilL.ns 
« oL 1995: Juneblu, « „/.. 1991 1994: GarreK « „/.. ,99a! Frey .994 , 

PROTEIN IDENTIFICATION BY AMINO ACID COMPOSITION AND N. TERMINAL 
SEQUENCE TAG 

When samples from 2-D gels are not unambiguously identified bv ami 



umino acid 



Ai.-iAt. a. u ILKI.V.S et a/. 



c imposition, pi an<l MW. often th- corr— i id* r 

i-.«p ranking* of the i Hohohm. Houthaeve and Sand*' ? J9Q^' J] rp, * ,n ' v amongst the 
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» .ncd Edman degradation and amino acid an.lv ' devete r™f * com- 
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PROTE.N IDEvnnCAT.CN BV PEPTIDE MASS FINGERPRINTING 

Technique, for the identification of proieinv bv nemiH, m r 

rccen.lv been de,cribed .Henzel ct a! ,9 9 V Pa2 u ""S"*™*"* have 

James,,,,,.. , 993: Mann. Hoimp and R^^^"??* ™l Bleas *' ,y9 * 
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A major challenge associated with peptide mass f.neerpriminn is data in,.^ • 
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A number of computer programs are available for matching pep,ide ma^es a™,, 
databases .reviewed in Coitrell. 1994). Mat:hin : is usually undertaken in an iSST 
uve manner, whereby peaks of mass 500-5000 Da are selected and matched under 
various search parameters including MW of protein, mass accuracv of peptides and 
number of missed enzyme clea vages allowed i HenH e t aL 1 993; Monz c , «/ j 904 . 
Rasmussen et aL 1 994 1. The correct proiein identity is the protein which ha* the mwt 
peptide masses in common with the unknown sample, Identities have been established 
with as feu as three peptides, but unambiguous idemificaucn is thought to require a 
mass spectrometnc map covering most peptides of the protein <Moriz a a! log- 
Yates et al 1993k To date, peptide mass fingerprinting of protein* ha's bee"n 
undertaken from the human myocardial proiein and keratinorvie maps from an£ »-,./, 
:-D gel. and from reference maps of Spimplasma melltunim and A/v« ,»,/„„;,,, 
xeniialhmn Sutton etol.. 1995: Rasmussen « a/.. 1994:Henzelr/a/.. 199V Cordwell 
et aL 1995. Was.nger et aL 1995). although the technique i* most powerful when 
used in combination with another protein identification technique t Rasmus a „/ 
1994: Cordwell ei aL 1995). f """ 

MASS SPECTROMETRY SEQUENCE TAGGING 

An extension of peptide mass fingerprinting has recently been described called 
peptide sequence tagging (Mann and Wi| m . ]094: M ann . 1995). This usCMandcm 
mass spectrometry (MS/MS) to initially determine the mass of peptides, then subject 
them to fragmentation by collision with a gas. and finally determine the mass of 
fragments. The resulting spectra gives information about a peptide's amino acid 
sequence. The f ragmentation masses of peptides can rarely he used to assign a complete 
sequence, but it usually allows a short sequence tag" of 2 or 3 amino acids t0 he 
determined. This sequence tag and the original peptide mass is matched bv computer 
against a database, providing a likely identity of the peptide and the protein it 'came from 
The major drawback for this technique as a mass screeninc tool is the complexity of the 
mass data generated and the high level of expertise required for its imerprciaium 
Nevertheless, it represents a useful new protein identification method which -really 
increases the power of peptide mass fingerprinting protein identification. " * " 

Cross-species protein identification 

Proie.n sequence databases continue to crow at a rapid rate, vet n 1* noi w.dclv 
appreciated that close to 90* of all information contained in current protein database's 
comes from onl> 10 species (A. Bairoch. Pers. Comm.i. Fonunatelv. this .nfornvnion 
can be used to study pro.eomes of organisms that are pourK denned at the molecular 
, • elcc,r °P ftor " is and -cross.s P ecies- protein .dentif.cation iCordwcl! „ 
aL 1 99y VV ^.ngcr c, aL 1 995 1. This approach allow, proiem* from reference maps 
01 many different species to be identified w.thout the need for thr corrcspondm- -encs 
10 be cloned and sequenced. This is panicularly true for "housekeeping proteinVsu.h 
as enzymes involved ,n glycolysis. DNA manipulation and protein manufacture 
which are highly conserved across spec.es boundaries/Proteins that canno. be 
identified across species boundaries can then become the focus of further proiein 
characterisation and DNA sequencing efforts. 
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rv compared ]f ,„e same pro.cn ivr* ,v or^r v'ed m l^u ? h " ' ' h k ,eth "'^ 
•Jem,.* of ,he unknown molecule (G.rdwell V, fl , TwT.A. n '^'^""Ni-nce .n ,h,s hemp ,he 
and Ho,„„ra,«r. 1 99, , where ,he ,me Zl L Ji cZL< HZ T. S> ^ ' APPC '- 
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Rapid cross-species identification of fwieins 'rom 2-D reference maps ca h* 
undsnaken with amino acid composition or pep-ide mass fincerpriminrmetho * 
« Figure 6t. but these -echmques alone ma> not identify protein* unambisuousiv when 
phyloeenetic cros^pcwies distances are r eat or analvsis data i< of poor qualirv jy w < 
et aL 1995: Shaw. 1993: Cordwell et aL J 995 1. However, very high confidence in 
protein identities can be achieved when lists of hest-matehing protein* ceneratedhv 
both techniques art compared (Cordwell et aL J 995: Wasinger et «/.T 1 995 » The 
correct identification is found when the same protein is ranked" hichlv in Iki* of b-<t 
match?* generated by both techniques. This method has allowed approximated PO 
proteins from the reference map of the mollicutc Zpimplasma mellifenmi. represent- 
ing approximately one quarter of the proteomc. io be confidently identified bv 
reference to protein information from other species ,S. Cordu ell. Personal Communi*- 
cationi. When cross-species protein identification i> to be undertaken, it should be 
noted that the molecular weight of a protein type across species is usually hi-hlv 
conserved, but that protein pi can van by more than 2 units (Cordwell et al 1995) 
Accurate molecular weight determination by direct mass spectrometry of proteins 
blotted to PVDF (Eckerskom et aL 1992) should therefore be a useful additional 
parameter for cross-species protein identification. 

CHARACTERISATION Or POST- TRANSLATION AL MODIFICATIONS 

Many protein* are modified after translation. Such post-translational modifications 
including elyeosylauon. phosphorylation, and sulfation (see Table 6). are usually 
necessary for protein function or stability. Some abnormal modifications are associ- 
ated with disease iDuthel and Revol. 1993: Ghosh ei aL. 1993: Yamashita et al 
1993). In proteome studies, post-translational modifications can be examined on all" 
proteins present, or on individual spots. Studies on all proteins provide an indication 
of which protein* may earn a certain type of modification. For example. 2-D ««cl 
analysis of cell culture* grown in the presence of [*H] mannose or |"P] phosphate 
gives an indication of which proteins cam- glycans containinc mannose. and which 
proteinsare phosphorylated (Garcelsand Franza. 1989). Lectin bindine studies oP-D 
gels blotted to PVDF or nitrocellulose provide information on the saccharides, if J n v 
thai are earned by proteins preseni (Grav-1 et al.. 199J). ' 

When individual proteins of interest carrying posi-translanonal modifications have 
been found, micropreparative 2-D electrophoresis can be used to purifv them in 
microgram quantities (Hanash et aL 1991: Bjellqvist et al.. 1993b). 'if protein 
isoforms of similar MW and p! are to be studied, focusing with narrow ranee pi 
gradients (1 pH unit) can provide greater separation and re'solution. After electro- 
phoresis, the type and degree of protein phosphorylation can be invested iMunhv 
and Iqbal. 1991: Gold et aL 1994). monosaccharide composition can be determined 
• Ueitzhandler et aL 1993: Packer et aL 1995). and the structure and exact site of 
glycoamino acids can be investigated b> either Edman dccradation based techniques 
or by mass spectrometry tPisano et aL 1993: Hubeny et aL 1993: Carr. Huddleston 
and Bean. I993i. With further development of rapid techniques, investigation of 
phosphorylation and monosaccharides by chromatographic or mass spectromeiric 
means is likely to become a routine step in the characterisation of post-translational 
modifications of proteins from reference maps. 
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The statu of proteome projects 

Many technical aspects of proieome research have already been discussed in thi< 
•*v, e w. l.ut an overv.eu- of the statu* of proteome projects has no. ve, Keen presented 
Advancts m proteome projects will initially rely on progress in e *enome sLencin* 
niiimves. 10 enable an identity, ammo acid sequence, or function to be assigned to 
each pro;e.n .pot. Table 7 show, genome size, proteome size, and :he number of 
proteins already defined for a number of model organisms. This indicate* that whilst 
genone sequencing programs for £. eoli and 5. cerevisiae are advanced, the massive 
size o. om; other genomes .and especially ihe human genome: means that 'their 
compter- nucleotide sequences are unlikely to be available for manv vear>. Because of 
this. _-D .eference maps and proteome projects of single cell oreanisms like AW 
plasma sp.. E. col, and 5. cerevisiae will be the most detailed (Cordwell ei al 1 995- 
Wasinger e, aL 1995: Vanbogelen e, aL 1992: Garrels ei aL 1994,. and complete 
maps of other organisms will take longer to construct. However, the use of cross 
spec.es protein identification techniques will allow proteomes 0 f manv prokan-ote 
and simple eukaryotes to be panially defined in reference to £. coli and/rrr^ 

Table 7: EMimaied eennme mzc. eMimaied proteome n umn #t M r ... . 
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The study of vertebrate proteomes and vertebrate development .s a phenomenal 
under.ak.ne ,n comparison ,o the .mesugat.on of s.ncle cell orcn.sms Th ■ 
because vast numbers of proteins are developmental* expressed. ca( T h bodv lisNUC ^ 
hundreds of un.que proteins, and there are numerous ti^e lvpev However „ t 
e.i.ma.ed that a, least 359, of proteins in vertebrate cells wi„ be conserved frlm t, u 
•o ussue. consmuung the -housekeeping-proteins ,B,rd. 1 995 ,. w„h the remaind-r of 

clecirophoreue conditions are used, reference maps from manv t.ssuev of one orean 
■ >m can be supenmposed in gel databases (e.g. Hochstrasser « al |99"i This 
accelerates the defmition of the -housekeeping- proteins, as well a, sets of prnt^s , h « 
are unique to Afferent tissue types. Such stud.es ma> . however, be cornel ca ed hv 
P^ran.lationa. modification, which can differ on the same gene prod uct 1 
d.fferen, nssues Proteins that rema.n unknown after idcmif.cat.on procedures wUI be 
useful , n providing focus f or nucleic acid sequencing initiatives 
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FITL'RE DIRECTIONS Or PROTONik PROJECTS 
This review ha, described recent advance, w ms m of m 
,Ilu,iraiedhou new development; ofolde, «echn,ques,:.Delcctro P hore.s;,, ndam.no 
acid analysis » a, well as ihe applications o; new technobev . nus< spectrometn » 
greatly widened the choice of to* the biologist and protein ctaK^ 
separation, identification and analysis of complex mixtures of proteins Thk h, L 7 
possible the establishment of detailed reference maps for o^ani uh" h 
becoming the method of choice for the definition of L~ o7~ll t d Z 
investigation of gene expression therein. ■ e 

Proteome protects are already impacting on the dosma of molecular W,^ u 
DN A sequence consmutes the definition o. an oreanism For eTamoie *2 n *' * 
of different tissues of a single organism are often ^S^^ KEST 
cross-species idemmcation of proteins (for example the identificat on of n * 
from Co*** u , htC a,n by comparison with 5. cLisia ^^^^ 
organisms that are poorly mo.ecularly defined. As cross-spceies SicauoTc^ 
proceed at . pace orders of magnitude faster than a genome pro/ec i„ ? e ^< of 
defining the gene and protein complement of organfms. the Led tie DN a 
fencing of genomes wi„ be avoided, and emphasis.p.aced on those f ound ^ 

Just as genome sequencing is not an end in itself, neither is an annotated :- D protein 
reference map of an organism, nor tndeed the identification of proteins in a pro^ome 
So whilst an immediate aim of proteome projects is to screen proteins in refe^nce 
maps tnis will lead to expression studies and characterisation of post-trantlat ^ 
mod,r,cat,ons. The challenge that then needs to be addressed is the invention . 
struct and function of proteins in a proteome. The magnitude of W^SSS bv 
the fact that over half the open reading frames identified in 5 rrm-i w«r chrnm ' 
III were inn.al.y of no known function (Oliver, l^S^J^^ 
Mud.es u ,„ be an undenting just as formidable as genome stud.es are now and 
proteome projects are becommg. bu, wil, , ca d to an unimapnablv detaUed und cr 
ending of how l.v.ng organisms are constructed and how thcy operate. 
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ABSTRACT Analysis of cellular protein patterns by 
computer-aided 2-dimensional gel electrophoresis together 
with recent advances in protein sequence analysis have 
made possible the establishment of comprehensive 
--dimensional gel protein databases that mav link pro- 
tein and DNA information and that offer a global ap- 
proach to the study of the cell. Using the integrated ap- 
proach offered by 2-dimensional gel protein databases it 
is now possible to reveal phenotype specific protein (or 
proteins), to microsequence them, to search for homology 
with previously identified proteins, to clone the cDNAs, 
to assign partial protein sequence to genes for which the 
full DNA sequence and the chromosome location is 
known, and to study the regulatory properties and func- 
tion of groups of proteins that are coordinated expressed 
in a given biological process. Human 2-dimensional gel 
protein databases are becoming increasingly important in 
view of the concerted effort to map and sequence the en- 
tire genome. — Celis, J. E.; Rasmussen, H. H.; Leffers, 
H.: Madsen. P.; Honore, B.; Gesser, B.; Dejgaard, K.; 
Vandekerckhove, J. Human cellular protein patterns and 
their link to genome DNA sequence data: usefulness of 

l r Tc^ l D m f nsi0naI gd clectr °phoresis and microsequencing. 
FASEB J. 5: 2200-2208; 1991. 

Af; Words- human protein patterns • 2-dimensional gel protein 
database* • gene expression • microsequencing • cDA r A cloning 
• linking protein and DXA information ♦ genome mapping and se- 
quencing 

Pkoteins synthesized from information contained in the 
DNA orchestrate mosi cellular functions. The total number 
oi proteins synthesized by a typical human cell is unknown 
although current estimates range from 3000 to 6000. Of 
these, as many as 70% mav perform household functions 
and are expected to be shared bv all cell tvpes irrespective of 
tneir origin. There are many different cell types in the hu- 
man body with perhaps 30.000 to 50.000 proteins expressed 
injhe - organism as a whole judged from the fact that about 
^ .< o the haploid genome correspond to genes. Today oniv 
a small fraction of the total set of proteins has been identified 
and little is known about the protein patterns of individual 
cell types or their variation under physiological and abnor- 
mal conditions. 

For the past 15 years, high resolution 2-dimensional eel 
electrophoresis has been the technique of choice to deter- 
mine the protein composition of a given cell type and for 
monitoring changes in gene activitv through quantitative 
and qualitative analysis of the thousands of proteins that or- 
chestrate various cellular functions (refs 1-6 and references 



2200 



therein). The technique originally described bv OTarrell . 
separates proteins in terms of their isoelectric point (pi) ar 
molecular weight. Usually one chooses a condition of in- 
terest and the cell reveals the global protein behavioral 
response as all detected proteins can be analyzed both 
qualitatively and quantitatively in relation to each other. A: 
present, most available 2-dimensionai eel techniques (regu- 
lar gel format) can resolve between 1000 and 2000 proteins 
from a given mammalian cell type, a number that cor- 
responds to about 2 million base pairs of coded DNA. Lr>* 
abundant proteins can be detected bv analyzing partial! 
punned cellular fractions. 

Two-dimensional gel ectrophoresis has been widely applied 
to analysis of cellular protein patterns from bacteria to mam- 
malian cells (refs 1-6. and references therein). In spite of 
much work, however, information gathered from these 
studies has not reached the scientihc community in its full- 
ness because of lack of standardized eel systems and the lack 
of means for storing and communicating protein informa- 
tion. Only recently, because of the development of appropri- 
ate computer software (7-13). has n been possible to scar 
gels assign numbers to individual proteins, and store tht 
wealth of iniormauon in quantitative and qualitative com- 
prehensive 2-dimensional gel protein databases (4, 14-23). 
i.e.. those containing information about the various proper- 
ties (physical, chemical, biological, biochemical, physiologi- 
cal, genetic, immunological, architectural, etc.) of all the 
proteins that can be detected in a eiven cell tvpe. Such in- 
tegrated 2-dimensional gel protein"daiabases offer an easv 
and standardized medium in which to store and communi- 
cate protein information and provide a unique framework in 
which to focus a mulndisciplinarv approach to studv the cell 
Once a protein is identified in the database, all of the infor- 
mation accumulated can be easilv retrieved and made availa- 
ble to the researcher. In the long run. protein databases are 
expected to foster a wide variety of biological information 
that mav be instrumental to researchers working in many 
areas of biology- among others, cancer and oncogene 
studies, differentiation, development, drug development and 
testing, genetic variation, and diaenosis of genetic and clini- 
cal diseases (Fig. 1), 

The approach using systematic 2-dimensional gel protein 
analysis has recently gained a new dimension with the ad- 
vent of techniques to microsequence major proteins recorded 
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"in, labeled proteins from normal human MRC-5 fib obfas Th f , S > n,hf,,t .' ma !; e °' 3 lr /™ °' »" >EF Hunr^ran, „l'|»SJmc.hio. 
bar, and SV40 transformed MRC-5 ,righ ba ffiorobla «s ? Pbltn r'h"" 5 * ^ pr " ,C,ns in MRC ^ 

* The funcion peruse annotation for spot a lows The ooerf o^n^ T' ,nlorma " on undcr lhl ' S>yrolv,ic pnthwa, 

(7, Relative abundance of cvtoskeletal and I ZZ a X "^T ^ ,nforma " on a ™'-'''<- ■"' liven p„„ ci „. 
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cross-matched experiments (18, 22). 

Once a standard map of a given protein sample is made 
one can enter qualitative annotations to make a reference 
database. Our master 2-dimensionaJ gel database of trans- 
formed human amnion cell (AMA) proteins (20) lists 3430 
polypeptides of which 2592 correspond to cellular compo- 
nents having pi's ranging from 4 to 13 and molecular 
weights between 8.5 and 230 kDa. The most abundant pro- 
teins in the database correspond to total actin (3.87% of total 
protein; about 90 million molecules per cell) while the 
esser abundant of the recorded polypeptides are present in 
the vicinity of 5000 molecules per cell. Some annotation 
categories we are using to establish the master AMA data- 
base include: /) protein identification (comigration with 
purified proteins. 2-dimensional immunoblotting, microse- 
quencing); 2) amounts (total amounts and levels of svnthe- 
sis); 3) subcellular localization (nuclear, cvtoskeletal, mem- 
brane membrane receptors, specific organelles, etc.)- 4\ 
antibodies; 5) postradiational modifications (phosphorvla- 
tion, glycosylat.on, methylation etc.); 5) microsequencing; 7) 
cell cycle specificity (specific variations in levels of svnthesis 
and amount); 8) regulatory behavior (effect of hormones 
growth factors heat shock, etc.) 9) rate of svnthesis in nor- 
mal and transformed cells (proliferation sensitive proteins 
nll Cy / PeC '^ P ro,eins ' oncogenes, components of the 
pathway (or pathways) that control cell proliferation)- 70) 
function (mainly from comigration with proteins of known 
function); 77) sets of proteins that are coordinated regulateS 
(hierarchy of controls, differential gene expression in various 

sSfi CD ^ AS (Cl ° nCd CDNAS > ; iJ ) P roteins *« a"e 

™, Yl'T, d,SCaSe < s y stema »<: comparison of protein 
5fJ 7"i ° ff S 0b,ast P roteins frcr n healthy and diseased in- 
S>NA % ^) expression and exploitation of transfected 
cDNAs; 75) pathways (metabolic, others); 75) gene localization 
(genenc and physical); 17) effect of microinjected antibody 
on patterns of protein synthesis; and 18) secreted proteins 

cai n ^ man0n K emCr f d f ° r any SP ° 1 in 3 ^notation 
di UP ,u™ ? CaS,ly retnCVed ^ askin S the computer to 
d splay t he information on the color screen. For example 

4 MA j u 2 ^ nthetlc ima S e of a NEPHGE gel (master 
AMA database displaying the information contained under 
the entry glycolytic pathway. Alternatively, one can use the 
function peruse annotations for spot to directly ask the com- 
puter to | Ist aH the emries avaj]able for a ? 

n fetal human tissues) „ ,s possible to take a quick look a, 
the information in that particular entry (Fig 2F) 

2 £ZT ° b ^ C encoumcred in Gilding comprehensive 
2 dimensional gel protein databases is identifving the large 

5auba"es° 20 r ° 2 n n t by ^logy. In our 

Trnll ( ' )' L knOW " proteins are identified by one or 
a combination of the following procedures: 7) comigration 
*. h known proteins 2) 2-dimensional gel immunoblotting 

C^LJTf C » am, B b , 0dieS ' 2nd 3) m i^«quencing of 
Coomassie Brillant Blue stained human proteins recovered 
from dned 2-d.mensional gels (see next section) Protein 
t'Z ty m " nS ° f m ^osequencingmay oe difficult 

a ind.v.dual protein members of families with short pepn7e' 
differences may escape detection. In the gene-proteiE data' 

£S£f avl^ 2 (H * 23)> an ° ther ^Ztl . 
database available at present, proteins are being identified bv 
a w.d er range of tests that include comigration with purified 
protems; genetlc criterjon (ddet . o £ ^Sjf 

nonsense, m.ssense, regulatory), plasmid-bearing strains 
tion ,n n . V,tr ° K S y n ! hesis of protein; selective labeling (methyl 
uon phosphorylation); peptide map similarity; and physio- 
logical criterion and selective derivatization 



So far we have received nearlv 550 antibodies from 
tones all over the world and these are beZl 'Z 
tested by dimensional gel immunoblotting fc * n ™^ 

TT- Similar, >'- purified Proteins and o Sell: 
provided by several laboratories hav? greatlv aided idfn "n 
uon of unknovvn 0leins (2fJT We routindv £ 

attL a e n jl P h r0te T Samp,eS Pr ° misC thf d0n °" «* ™« 
avaUab e all the information we mav have accumulated on tha; 

pamcular protein. For example. Table 1 lists entries availa- 
ble for Lipocortin \ (IEF SSP 8216). also known as annexfn 
v. VAU-o_ endonexm II. renoconin. chromobindin-5' an- 

srsargf PAP - r rcajc,med,n - ibc - ****** 

As mentioned previously, one distinct advantage of 
2-d.mens.onal gel electrophoresis is the possibilitv of studv- 

mlvKTT Va / iat, ° nS I-" CCllul3r pr0te,n P auerns tha < 
ZZlt 'dentification of groups of proteins that are ex- 
pressed coordinate!)- during a given biological process 

ST^TI^T"' " 1,01 an CaS >' task as reflec «d bv the 
lack of pubhshed data on global cellular protein patterns We 
believe this is partly due to difficulties in obtaining sets of 
gels that are suitable for computer analvsis (streaking 

,T er ?w rema,n,n f at the ° r 'S in - c,c ) as « to lim.tl- 
t.ons (laborious editing time, need of calibration strips to 
merge , mages, limited dynamic range, etc.) in the computer 
analysis systems available at the moment. Perhaps the most 
advanced quantitative studies published so far using com- 

zZn£'% r been , camed out by Garre,s and co- 
workers (18, 22). In panicular, these investigators have estab- 

ished la quantitat.ve rat protein database (18. 22) designed 
timldnnT' lh " ntroI (P^iferation. growth inhibitors, and 
st.mulat.on) and transformation in well-defined groups of 
cSLnj b> ' tra nsformation of rat REF52 cells with 

Th! ade " ov ' rus - and the Kirsten murine sarcoma virus 
These studies have revealed clusters of proteins induced or 
repressed during, growth to confluence as well as groups of 
transformation-sensitive proteins that respond in a differen- 
tial lashion to transformauon by DNA and RNA viruses A 
most interesting feature of this quantitative database is the 
discovery of a group of coregulated protems that show simi- 
lar exp re ss,on patterns as the cell cycle-regulated DNA repli- 

("SNA^clm ^7 35 "» ™>™ ™i™ 

In our human databases, most quant.tations have been 
D ok™ H I eS j lma,in S ,he rad.oact.v„v contained in the 
polypeptides by direct countmg of the gel pieces in a scintil- 
«.on counter (20, 21). Up to 700 protems can Se cm out 

oarable T^"™ e * P T? ^ 3 P eriod of time com- 
parable to that requ.red for editing a svnthetic image. 

wlhc^ ?, Uamita,,on of this '"I* "umber ofspots is difficult 
w thou, the ass.stance of a master reference image and a 
numbering system that can be used to identifv the spots. Us- 
ng this approach we have recorded quantitative changes in 
he relat,ve abundance of 592 [»S]methionine-labeled pro- 
llZr Ju^ by ^ uicscent - Proliferating, and SV40 
transformed human embryonic lung MRC-5 fibroblasts (21) 
borne data concerning cvtoskeletal and cvtoskeletal-related 
r n f S r art Presented in Fig. 2G. Our studies as well as 
ZZ m a S a " d c f °- worke " (18. 22) may in the long run 
help define patterns of gene expression that are characteristic 
ol tne transformed state. 

dI^base? IMENSIONAL GEL protein 



As mentioned previously there are other 2-dimensiona] gel 
databases available in computer form that have been pub- 
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phocytes. leukocytes, leukemic cells) mouse (NTH/3T3 cells. 
T lymphocytes). Aplysia. yeast (Saccharomyecs crrcvisae), plants 
(wheat, barley sorghum), and Euglena. Databases of tissue 
protein, (brain, whole mouse, liver) and body fluid proteins 
^plasma proteins, cerebrospinal fluid, urine, and milk) are 
being established in several laboratories. The reader is 
directed to the review by Celis et al. (4) for details and refer- 
ences concerning these databases. 



MICROSEQUENCING HAS ADDED A NEW 
DIMENSION TO COMPREHENSIVE 
2-DIMENSIONAL GEL DATABASES: A DIRECT 
LINK BETWEEN PROTEINS AND GENES 

The development of highly sensitive amino acid gas-phase or 
liquid-phase sequenators (24), together with the establish- 
ment of efficient protein and peptide sample preparation 
methods, has opened the possibility to perform a systematic 
sequence analysis of proteins resolved by 2-dimensional gel 
electrophoresis. Indeed, generated pieces of . protein se- 
quences can be used to search for protein identity (compari- 
son with available sequences stored in databanks) as well as 
for preparing specific DNA probes for cloning of as vet un- 
characterized proteins (Fig. 1). In addition, partial protein 
sequences can be stored in 2-dimensional gel databases (for 
example, see Fig. 2H) and offer a unique link between pro- 
teins and genes (Fig. 1). 

In the early 1970s gel electrophoresis was used to purify 
proteins for sequencing purposes (reviewed by Weber and 
Osborn in ref 25). Proteins were recovered by diffusion and 
sequenced by the manual dansyl-Edman degradation at the 
nanomole level. This technique was further refined by using 
electro-elution to recover proteins and by miniaturizing the 
system (26). This method has been used extensivelv, but 
showed increasing drawbacks (low yields, protein samples 
contaminated by free amino acids, and NH 2 -termina! block- 
ing) as the amounts of handled protein gradually became 
smaller (e.g., at the 10 picomol level). 

Most of the problems referred to above have been 
minimized with the introduction of protein-electroblotting 
procedures (27-32). When proteins are blotted on chemi- 
cally inert membranes, it is possible to sequence the immobi- 
lized proteins directly without additional manipulations. 
Thus, depending on the amount of bound protein and its na- 
ture, this direct sequencing procedure generally yields NH r 
terminal sequences containing 10-40 residues. As such, this 
technique was used to identify, by their NH r terminal se- 
quences, differentially expressed major proteins from total 
cellular extracts separated on 2-dimensional gels. A major 
difficulty encountered in this procedure is the occurrence of 
irequent artefactual blockage of the proteins. Several studies 
suggest that this phenomenon is mainly due to reaction with 
contaminants (particularly unpolymerized acrylamide 
present in the gel) and to a high dilution of the protein (low 
concentration of the protein per unit membrane surface). In 
addition to this primarily technical problem, many proteins 
are blocked in vivo by acylation or by a pvrrolidon carboxylic 
acid cap. 

The problem of partial or complete NH 2 -terminal block- 
age can be circumvented by generating internal amino acid 
sequences. This is achieved by fragmenting the protein 
present in the gel (gel in situ cleavage) or by cleaving it while 
bound to the membrane (membrane in situ cleavage) 
(33-35). In both cases, proteins are either cleaved in a res- 
tricted way (e.g.. by limited enzymatic digestion or bv using 
restriction chemical cleavage conditions) or fragmented into 
smaller peptides. 
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Of -the different combinations examined, we h ie <*■ v ■•■ 
results by using exhaustive proteolvtic dicesno/\U" 
membrane-immobilized proteins. This method has br~" 
described for Ponceau red-stained proteins on nitroceiiuh.V-' 
blots (j4). for Amido-black^tained Immobilon-bounc pr. 
teins. and for^fluorescamine^detected proteins on glas> lib-, 
membranes (35). The proteases used (trypsin. chvmotrvp>ii. 
or pepsin) cleave at multiple sites, generating small peptide 
that elute from the blot into the digestion buffer from which 
they are purified bv reversed-phase hieh performance liquid 
chromatography (HPLC) before being sequenced individu- 
ally Although each of these manipulations could be expected 
to result in a reduced yield of final sequence information, wo 
were surprised that the peptides could be sequenced with 
high efficiency. In our hands, this approach could be rou- 
tinely applied to gel-purified proteins available in amount* 
ranging from 5 to 10 fig, and often vielded sequence informa- 
tion covering more than 307c of the total protein. A* 
membrane-immobilized proteins are not homogeneously 
digested, but rather show protease sensitivity next to resis- 
tant regions, the number of peptides generated is much lower 
than expected from the number of potential cleavage sites. 
Consequently HPLC peptide chrom atosrrams arc less com- 
plex and most peptides can be recovered in pure form. 

As only limited amounts of a protein mixture can be 
loaded on a 2-dimensional gel. proteins of interest are often 
obtained in yields insufficient for the currently available se- 
quencing technology. More material can be obtained bv en- 
riching for a certain subcellular fraction (purified cell or- 
ganelles) or by exploiting affinity (dyes, metals, drugs, etc) or 
hydrophobic properties of proteins before gel analysis. All of 
the sequencing results accumulated so far in the human pro- 
tein database (20) (a few are shown in Fig. 2H) have been 
obtained from analysis of protein spots collected from 
2-dimensional gels that had been stained with Coomassie 
blue according to standard procedures and dried for storage. 
Proteins are recovered from the collected gel pieces bv a 
protein-elution-concentration device, combined with gel 
electrophoresis and electroblotting. Details of this technique 
have been reported in a previous communication (42) and a 
brief outline is given below. 

Combined gel pieces are allowed to swell in gel sample 
buffer (a total volume of 1.5 ml). The gel pieces combined 
with the supernatant are then collected into a large slot made 
in a new gel. The slot is further filled with Sephadex G-10 
equilibrated in gel sample buffer. During consecutive gel 
electrophoresis, most of the electrical current passes on the 
side of the slot instead of passing through the slot. This 
results in both a vertical stacking and horizontal contraction 
of the protein band. With this device the protein is efficiently 
eluted from the gel pieces and concentrated from a large 
volume into a narrow spot. The highly concentrated (about 
5 mm 2 ) protein spot is then electroblotted on PVDF- 
membranes. stained with Amido black, and in situ digested 
with trypsin. The peptides generated during digestion elute 
from the membrane into the supernatant, and can be sepa- 
rated by narrow bore reversed-phase HPLC and collected in- 
dividually for sequence analysis. 

Using this and previous procedures (37, 39, 42), we have 
so far analyzed 70 protein spots collected from 
2-dimensional gels (20, and unpublished observations) (see 
for example Fig. 2/7). The sequence information amounts to 
2100 allocated residues corresponding to an average of 30 
residues per protein spot. So far we have made cDNAs of 
many of the unknown proteins that have been microse- 
quenced. and a substantial number has been cloned and se- 
quenced. All available information indicates that it may be 
possible to obtain partial sequence information from most of 
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Nonen^Tnadc extraction of cells from clinical tumor 
material for analysis of gene expression bv two- 
dimensional polyaciylamide gel electrophoresis 

SS^iSS^SiSS? of preparation of ,umor ce,,s - inc 'S 

"sing eSSTxS«,^2?JK feezing, have advantages over methods 
to reduce iS? Lf h^h i«r , lS ' Nonen2 >™atic methods are rapid, appear 

these techniques meh-auaHtv 2 DF 2 Pcrcol, -S rad, «" centrtfugation. Using 
lung and b STtato ?«uUi L^* derived from tumors of the 
non-muscle ^mJSi^SvSSF^ P ™'™- hMt $hock pro,e,ns ' 
dude that wSSS^^^^T filamem V n identif «« d - We con- 
improves ^SZ^S^l^^ 1 cclls from fresh '"mor tissue 
nosis P° ss "»l««es that these techniques may be useful in clin.cal diag- 



1 Introduction 



Tumors may develop by a number of different mechan- 
sms m any g.ven cell type. At the time of diagnosS 
tumors w.ll have progressed along different pathwavs to 
anous stages of malignancy. To provide a basis foHndi- 
Mdual therapy u is of importance to examine specific 
propert.es of the tumor cell population in each patient 
A large number of different markers have been dei 
scribed in order to increase the diagnostic accuracy It is 
likely that a combination of serveral markers is needed 
in the future in order to reflect different properties of 
he tumor. One important method for the resolution of a 
large number of potential markers is two-dimensional 

m e Z P , h f° reS,S ' 2 - DE) - ExlenS,VC efrorts are nide 
m identifying various polypeptides separated bv 2-DE 

iid«° e r Ct Tl e h ° W thC ex P ressi °" ol" these polvpep- 

,on ,„„ b> ', lhe reSP ° nse 10 cellular transfomS- 

lion and various culture conditions [1.2J. It would be of 

Dolvoeotid'^f ' h,S infor ™ tion 10 2 -DE separations of 
Pol> peptides from tumor tissue samples. However one 
prerequisite is that the quality of the 2-DE gels 'from 
tumor samples is comparable in quality with 2-DE eels 
from samples of cultured cells. 

t . Um ° r liSSU " are comm °nly used for various bio- 

the noK-nl^ r lyacr >' lam,de electrophoresis (PAGE). 

of semm P .S? Pat,ernS art ° bSCured by "ntaminat.on 

mo .c e Tr' e .lH COnneCMVC tiSSUC pr0leins - Such n °™- 
mor-cell-related variations represent serious problems in 

the ,nterpretat,on and inter-patient comparison of 2-DE 
D°J 3 e r?r dCnC,: D Dr B ° Fran " n " Divis,on 01 Tumor Pa.hology 

te n ^rr^r Kiiro,,nsu Hosp -' »- 

JS^EF 2 ' 0 ^ Wdimen »°" al Pol.vacrMam.de gel elcc.ro- 
VP J0 v I ' soe,ec,r,c '•««"»: LOH. bcuic dehydrogenise- 
ranne ^"' de : ^ PBS - Ph0SBha,e bvtT «« ?CNA pX 

* l0n>l ' 1UOr,I,e: SDS - <od,um d °««">' wlf«c: »X *e 



m^l?} , DE PaUernr ° f Cells pre P arcd f ™ f«sh 
ofTumS S mT ana - iyMd af,Cr enzvma,ic "'""ion 
5 l4, 3) ° r after cullur,n S «"nor fragments in 
medium containing radioactive amino acids [61. These 
procedures may. however, lead to alterations in the gene 
expression/polypeptide patterns. We are onlv aware of 
one study where nonenzymatic extraction of cells from 
tresh tumor tissue (prostate cancer) was used to prepare 

e a xS^° r a n d D PAGE |4) - WC have examined ^SSS 
extraction and vanous nonenzymatic preparation tech- 

mques including fine needle aspiration' fo? the prepara- 

tion of cells from fresh tumor tissues. We describe 

IO hTK ymal ! C eX , tr ?£ i0n P roced «'es that are rapid, lead 
to high-qualitv 2-DE patterns, and that alleviate the 
necessity ,o punfy tumor cell populations from deaS 

2 Materials and methods 

2.1 Cell cultures and samples used for spot 
identification 

A rat embryonal fibroblast cell line. WT2 (a kind sift 
from Dr. j 1. Carrels and Dr. S. Pattersson) was used fo 
the identification of a number of heat shock and struc- 

M \ tr f Mrr 7 P hCi,al breaSt carcin °™ cells. MDA- 
231 and MCF-7 were purchased from ATCC and grown 
as recommended. Polypeptides prepared from f leu- 
kemia type pre-B-ALL were separated by 2-DE The 
2-DE map was then analyzed by Dr. S. M. Hanash (Uni- 
versity of Michigan. Ann Arbor. USA). 
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2.2 Tumor tissues samples 

In this study. 2-DE maps from seven tumors were used 
Z ZV m 1 ! 11 " 6 !"" s D lrations: tw ° adenocarcinoma of 
diate .id! % LB. mucinous, both cases interme- 
of L Zt ?. f c d » fferent, "'?n). one sqamous carcinoma 
ot the lung (LS). one carcinoid-like breast cancer (BC) 

^ ,T' Cr0 I 0 ?T A U,ary adenoma (ni « n, V d iTerentiated) of 
the thyroid (TA). one highly differentiated hyperneph- 

"1-.-.IIS15/M3/I0IO.IO45 S5.0O*.25/0 
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fully. m.xed for 2.5 h and cemrifuged for 15 min a, 
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10000 rpm to remove anv insoluble ma.eri .i n 
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cells (Fig. 2d). Polypeptides were identified through a 
laboratory exchange of cell samples/2-DE maps and 
through 2-DE analysis of purified proteins (Table I). 

3.2 Preparation of samples from solid tumors 
3*2.1 Fresh versus frozen tissue 

An adenocarcinoma of the lung (LA) was prepared for 
2-DE by conventional methods using frozen material 
(Fig. 3a). There are several possibilities" for the poor reso- 
lution using frozen tissue, including the presence of high 
molecular weight protein aggregates. Filtering extracts 
through 0.1 um filters (Durapore. Millipore) resulted in 
a slightly improved resolution (not shown). When fresh 
tumor tissue from tumor LA was used for sample prepa- 
ration, using fine needle aspiration to collect the cells, 
the resolution was considerably improved (Fig. 3b). The 
use of fresh tissue resulted in a general increase in reso- 
lution, which was most pronounced in the 50-100 kDa 
molecular mass range. A number of differences in the 
protein profiles of the gels in Figs. 3a and 3b can be ob- 
served, some of which are indicated in the figures The 
decrease in serum albumin in Fig. 3b is likelv to result 
from loss of serum proteins occurring when "cells were 
pelleted after aspiration. Other differences, such as the 
decreased level of transformation-sensitive tropomvosins 
(TM1-TM3). may result from enrichment of tumor cells 
m the sample of Fig. 3b. Fine needle aspiration, a well- 
established technique in cytology, extracts mainiv tumor 
cells because of decreased intercellular adhesiveness of 
neoplastic cells as compared to normal tissue. Micros- 
copic examination of DifT-Quick-stained extracted cells 
from case LA revealed almost 100% tumor cells 
whereas the whole tissue extract contained approximate- 
ly 60°o tumor cells. 
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Table 1. Names and abbreviations for identifies >rot, 

Spot Name o. €1 . 

— ^ oasi* lor iQt 



A Acuns 
*A a/p>>a-Actinin 
B23 Protein B23 /Numatnn 
EF7 Elongation factor 2 
EF1 Elongation factor I 6 
GT Glutathione-S-transpherase {pi 
hip60 Heat shock protein 60 
hsp73 Heat shock protein 73 
hsp80 Heat shock protein 80. GRPTS. BIP 
hsp90 Heat shock protein 90 
hsplOO Heat shock protein 100. Endopiasmin 
IFa Intermediary filament associated 
k8 Cytokeratin 8 
LamB Lamm B 
Lipl Lipocortin I 
Lip2 Lipocortin II 
Lipocortin V 
Mitcon 1/6 - Fl ATPase 
Mitcon 2 
Mitcon 3 

Mucme Related Polypeptides 
Ploliferaung cell nuclear antigen 
Phospholipase C (1) 
RO/SS-A antigen 
Serum Albumin 
o/pho-Tubulin 
6f/>io*Tubuiin 

Non-muscie tropomyosin isoform 1 
Non-muscle tropomyosin isoferm 2 
Non-muscle tropomyosin isoierm 3 
Non-muscle tropomyosin isoform 4 
Non-muscle tropomyosin isoform 5 
Those phosphate isomerase 
Vimentin 

Vimentin derived protein 
Vimentin derived protein 
Vimentin derived protein 
Vimentin derived protein 
Vinculin 
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Mit3 
MRP 
pena 
PLC 
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Vid4 
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difference in intensity were lower than when a nonenrv- 
mane preparation was compared with an enzvmatic ore- 
paration. K 

2-DE maps of satisfactory quality were prepared bv a 
third procedure. Cells were released from small pieces' of 
tumor by squeezing (see Section 2). Some examples of 
this are shown in Fig. 6 where 2-DE maps derived from 
a case of hypernephroma. KH (Fig. 6a). a case of thvroid 
tumor. TA (Fig. 6b) and a case of corpus cancer CP (Fie 
6o can be seen. We conclude thai nonenzvmatic tech- 
niques are useful for 2-DE analysis of a number of dif- 
ferent tumors. The quality of the resuliine gels is com- 



*«P«uon of human tU(Ben , 0f ^ ft> ^ ; 

ihfi'f. -° i? al , obuined cultured cells .comp^ 

these methods will be optimal will, in our experience 
depend on the tumor material. For example, verv small' 
tumors are preferably extracted by squeezing: on the 
other hand, breast cancers (which are often fibrous i 
yield satisfactory samples using scraping. 

3.2J Purification of eells on percoll gradients 

We considered the possible advantage of separate 
viable cells from dead cells, erythrocytes, and debris 
using discontinuous Percoll gradients. Cells collected 
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with these observations (Fig. 8). a number of potential 
and interesting markers, like tropomyosin isoforms. cyio- 
kerauns and heat shock proteins, appear to be insensi- 
tive to loss of viability during the preparation procedure 
we have to date made numerous observations of altera- 
tions in the expression of these polypeptides in breast 
cancers and lung cancers. 
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Another problem that may occur, irrespective of sample 
preparation techniques used, is admixture of Ivmpho- 
cyies. These cases are easily deteciable in smears and it 
may therefore be possible to select Ivmphocvte specific 
spots as -internal markers" for the 2-D PAGE analysis 
Studies using this approach are in proeress. Manv of the 
polypeptides identified are structural (table 1). Since the 
expression of many of these polypeptides are known to 
vary between normal and malignant cells, the possibility 
io determine their expression simultaneouslv is 
appealing. In the specific case of breast cancer, altera- 
tions in the expression of iniermediaie filament proteins 
(cuokeratinsi are known to occur during tumor progres- 
sion [23]. Other proieins known to be differentially 
expressed between normal cells and transformed cells 
arc tropomyosins. numatrin/B23. heat shock proieins 
and PCNA. To this end. we have observed alterations in 
the expression of cytokeraiin 8. hsp 90. and non-muscle 
tropomyosin isolorm 2 during malienant progression 
(Ukuzawu a „/.. m preparation and Franzcn ci al in pre- 
paration). w 

The method of choice lor sample preparation from 
tumor tissues will depend on ihe properties of the tumor 
material studied. It may be important to use onlv one 
method when comparing cases within one uroup. as dif- 
ferences were observed between methods? The advan- 
tages ol the nonenzymatic techniques arc li) that it mini- 
mizes coniaminaiion with connective tissue (ii) that 
problems w.ih contamination of scrum proteins are 
avoided, and inn ihat separation of viable and dead cells 
is not necessary. Hereby ihc rcvolvmu power of ">-D 
I AGE is maximized lor the analysis of human tumors 
and Mudies «.n micr-iumor variations in ecne expression 
are lacil.tated. In addii.on. the polypeptide patterns ob- 
tained ma> be more representative for the m vn„ tumor 
■■ell mhcc the use ol enzymes and incubations have been 
minimized. 
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Reference points for comparisons of two-dimensional 
maps of proteins from different human cell types 
defined in a pH scale where isoelectric points correlate 
with polypeptide compositions 



A highly reproducible, commercial and nonlinear, wide-range immobilized pH 
gradient (IPG) was used to generate two-dimensional <2-D> eel maps of 
rsjmethionine-labeied proteins from noncultured. unfractionated normal 
human epidermal keraiinocytes. Forty one proteins, common to most human 
cell types and recorded in the human keraiinocvte 2-D eel protein database 
were identified in the 2-D gel maps and their isoelectric points v pfl were deter- 
mined using narrow-range IPGs. The latter established a pH scale that 
allowed comparisons between 2-D gel maps generated either with other IPGs 
in the first dimension or with different human protein samples. Of the 41 pro- 
teins identified, a subset of 18 was defined as suitable to evaluate the correla- 
tion between calculated and experimental p/ values for polypeptides with 
known composition. The variance calculated for the discrepancies between cal- 
culated and experimental p/ values for these proteins was 0.001 pH units 
Comparison of the values by the /-test for dependent samples (paired test) 
gave a p-level of 0.49. indicating that there is no significant difference between 
the calculated and experimental p/ values. The precision of the calculated 
values depended on the buffer capacity of the proteins, and on average it 
improved with increased buffer capacity. As shown here, the widelv available 
information on protein sequences cannot, a priori, be assumed to be sufficient 
lor calculating p/ values because post-translational modifications, in particular 
A-terminal blockage, pose a major problem. Of the 36 proteins analvzed in 
this study. 18-20 were found to be .V-terminally blocked and of these onlv 6 
were indicated as such in databases. The probability of A-terminal blockage 
depended on the nature of the .V-terminai group. Twenty six of the proteins 
had either M. S or A as A'-terminal amino acids and of these 17-19 were 
blocked. Only 1 in 10 proteins containing other .V-terminal eroups were 
blocked. 



1 Introduction 

As compared with carrier ampholyte isoelectric focusing 
(CA-IEF). the application of immobilized pH gradients 
(IPGs) in the first dimension in 2-D gel electrophoresis 
offers improved reproducibility [1] because the nature of 
the pH gradient makes the resulting focusing positions 
insensitive to the focusing time [2) and to the type of 
sample applied [3). The recently introduced ready-made 
IPG strips (4) seem to be an ideal substitute for the car- 
rier ampholyte gradients, which until now have been the 
most commonly used first dimensions in 2-D gel electro- 
phoresis. The availability of standardized first dimen- 
sions opens the possibility of comparing 2-D gel maps of 
various cell types generated in different laboratories, pro- 
vided that the focusing positions of a number of easily 
recognizable polypeptide spots common to the cell types 
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in question are known. Even though this approach is 
limited to experiments performed with the same standar- 
dized IPG. the flexibility provided by IPGs allows the 
pH gradient to be adjusted to the requirements of a par- 
ticular experiment. 

Exchange and communication of 2-D gel protein data re- 
quires a pH scale thai is independent of the particular 
IPG used and by which the results can be described. The 
introduction of carbamylation trains and the relation of 
focusing positions to the spots in these trains repre- 
sented a step forward towards solving the reproducibility 
problem experienced with carrier ampholyte focusing (5|. 
Problems associated with the use of carbamylation trains 
were mainly due to lack of temperature control and to 
the use of nonequilibrium focusing conditions. Accord- 
ingly, the pattern variation involved not only the re- 
sulting pH gradients, but also the relative spot positions 
as related to each other and to spots in the carbamyla- 
tion trains. Even though the question of reproducibility 
has. to a large extent, been solved, the carbamylation 
trains are still not ideal as markers because the spots in 
the trains do not represent defined entities but rather a 
large number of differently carbamylated peptides 
having close pi values. As a result, the spots are large 
and poorly defined as compared to the ordinary polypep- 
tide spots in 2-D gel maps. 
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Neidhardi [6] defined the pH gradient in 2-D gel 
experiments by p/ markers whose p/ values were calcu- 
lated from the amino acid composition. Focusing posi- 
tions of other polypeptides could be predicted from their 
composition but the pA' values needed for the p/ calcula- 
tions were unknown. Various groups employing this 
approach do not use the same pK values [6. 7) and there- 
fore, the pi values derived in this way cannot be 
expected to describe the variation of the hydrogen ion 
activity. In spite of this fact, it is still possible to make 
approximate predictions of focusing positions because 
the pK values used to define the pH gradient are also 
used to calculate p/ values and to predict the focusing 
positions. Errors in pK assignments are therefore com- 
pensated. A pH scale which corretly reflects the variation 
in hydrogen ion activity during focusing should improve 
the precision of the predictions, but this has never been 
implemented with CA-IEF focusing as a first dimension 
in 2-D gel electrophoresis. The main reason for this are 
the problems associated with pH measurements in 
focused gels containing high concentrations of urea. 

IPGs can be described from the concentration variation 
of the immobilized groups, provided that the pA' values 
of these groups are known for the conditions prevailing 
during focusing. To avoid measurements on gels. Gia- 
nazza etaL [8] suggested the use of pK values derived by 
addition of determined pA' shifts. Recently, direct deter- 
minations of pK differences between immobilized 
groups in IPGs were made by determining p/-pA' values 
in overlapping narrow-range IPGs [9, 10] and the results 
verified the applicability of the Gianazza approach. A 
description of the focusing results in a pH scale, which . 
correctly describes the variation of the hydrogen ion 
activity for the focusing conditions used, not only allows 
the comparison of 2-D gel maps generated with different 
IPGs, but also opens the possibility for correlating the 
focusing position of a polypeptide with its composition 
[9). Experiments by Bjellqvist etal. [9, 10] have implied 
that pH scales showing good correlation between calcu- 
lated and experimental pi values can be derived for any 
of the conditions commonly used for focusing in connec- 
tion with 2-D gel electrophoresis. These pH scales are 
then defined through the pK values of the immobilized 
groups in the IPG containing gel. To be useful for inter- 
laboratory comparisons, however, the pH scale has to be 
defined through pV values of easily recognizable spots 
present in the 2-D gel map. So far, pi determinations in 
a useful pH scale, combined with determinations of pK 
values needed for pi calculations, have only been made 
for the pH range 4.5-6.5 at 10°C [9]. CA-IEF focusing as 
described by OTarrell [11] does not control the tempera- 
ture of the first dimension, which can be expected to be 
slightly above room temperature. With IPGs, the temper- 
ature commonly used is about 20°C [4, 12] or 25 °C [13] 
and this is a critical parameter that needs to be con- 
trolled [14]. 

The present work was designed to compare 2-D gel maps 
of different cell types in a laboratory applying both 
CA-IEF and IPG focusing at a common temperature. To 
this end we have generated 2-D gel maps of proteins 
from noncultured, unfractionated normal human epi- 
dermal keratinocytes with IPG in the first dimension 



and a focusing temperature of 25 C We have used -orv 
mercial nonlinear, wide-range IPG strips which gi\e\b 
gel maps that are closely similar to the ones result 
with the CA-IEF technique used to establish the human 
keratinocyte database [15]. As an initial step towards 
interlaboratory comparisons of results obtained with the 
nonlinear gradient as a first dimension we report here 
on the focusing positions of 41 known proteins that are 
common to most human cell types. The pH range 
covered corresponds to the range in classical CA-IEF 
2-D gel electrophoresis and in order to use these pro- 
teins as internal standards for comparing 2-D gel maps 
generated with other IPGs we determined their p/ values 
with narrow-range IPGs in the first dimension. We have 
compared the calculated versus experimental pi values 
and show that it is necessary to have further information 
(absence or presence and nature of postradiational 
modifications), in addition to amino acid composition to 
be able to calculate pi values thai correspond to the 
actual experimental values. The pA' values used for the 
calculations are provided and the usefulness of p/ predic- 
tion in relation to database information is discussed. 
Furthermore, we comment on the possibility of using 
experimentally determined pi values to verify the avail- 
able database information on polypeptide composition. 



2 Materials and methods 

2.1 Apparatus and chemicals 

Equipment for isoelectric focusing and horizontal SDS 
electrophoresis (Multiphor v II electrophoresis chamber, 
Immobiline' strip tray. Muliidrive XL programmable 
power supply. Macrodrive power supply and Multitemp* 
II) was from Pharmacia LKB Biotechnology AB 
(Uppsala, Sweden). Vertical second-dimcnsiona'l gels 
were run in the home-made equipment described in [15J. 
The IPG strips with the wide-range nonlinear pH gra- 
dient were either Immobiline DryStrip' pH 3-10 NL, 
180 mm or alternatively 160 mm long IPG strips with a 
corresponding pH gradient. In both cases the IPG strips 
were delivered by Pharmacia LKB. Immobiline. Pharma- 
lyte. Amphohne. GelBond as well as PAG film and the 
ready-made horizontal SDS gels f ExcelGeP XL SDS 
12-14) were also from Pharmacia LKB. Purified proteins 
and peptides were from Sigma (St. Louis. MO). 

2.2 Sample preparation 

Preparation and labeling of unfractionated keratinocytes 
as well as fibroblasts have been described in [16). Cells 
were lysed in a solution containing 9.8 m urea, 2°/b w/v 
NP-40. 100 m.M DTT and 2°/o v/v Ampholine pH 7-9. 

2.3 2-D gel electrophoresis 

First-dimensional focusing was performed according to 
Gorg eiai. [2] with some minor modifications, as de- 
scribed in [9J. Rehydration of the IPG strips was made 
in a solution containing 9.8 m urea. 2% w/v CHAPS, 10 
mM DTT and 2°/o v/v carrier ampholyte mixture. The car- 
rier ampholyte mixture consisted of 2 parts Pharmalyte 



J I?" 5 o ' P! n Ampholine pH 6-8 and 1 pan Pharmalvte 
pH 8-10.3. Usually, cathodic sample application was 
used and the samples were diluted 2-20 times in a solu- 
tion containing 9.8 m urea. 4<-o w/v CHAPS, l'o w/v 
DTT and 35 m.y Tris base. For acidic application the 
Tris-base was substituted with 100 m.M acetic acid The 
degree of dilution and sample volume (20-100 uL) 
depended on the particular sample and the IPG and 
whether visualization of the proteins was to be done bv 
Coomassie Brilliant Blue or silver stainine. With the 
w.de-range non-linear IPG. 10-30 ug of total protein 
was loaded for silver staining and 100-200 ue for Coo- 
massie staining. Focusing was done overnight with Vh 
products in the range of 45-60 kVh with 160 mm lone 
strips and 50-70 kVh with 180 mm long strips. Solubili- 
zation of polypeptides and blocking of -SH eroups prior 
10 the second-dimensional run. as well as loadine on the 
second-dimensional gel was done as described in [9] 
The stacking gel was omitted and 5-10 mm were left at' 
the top of the second-dimensional gel for applvinc the 
IPG strip. The space was filled with electrode buffer con- 
taming 0.5% w/v agarose. Casting, runnina. stainine and 
auiorauiography were carried out as described in [15]. 

2.4 Experimental determination of p/ values 

The determination of the pA' differences between Immo- 
bilincs pA 4.6. pA- 6.2 and pA' 7.0 necessarv for the cali- 
bration of the pH scale at 25 C in 9.8 m urea was done 
as described in [9] with the same narrow-range IPGs 
The pH scale was defined by setting the pA - value of 
Imrnobilinc pA" 4.6 equal to 4.61 [9] and the determined 
PA differences gave the pA' values of Immobilines pA'6 "> 
and pA 7.0. equal to 5.73 and 6.54. respectively. The pA' 
differences found arc in good agreement with values de- 
rived Irom |17] and [8] by extrapolation to 9.8 m urea 
concentration. As in [9]. additional narrow-ranue recipes 
have ncen used lor determining p/ values. With narrow- 
ningc IPGs extending to pH values higher than the pA' 
valuc oi Immobiline pA' 7.0. anodic sample application 
»as used with acetic acid added to the sample solution 
Otherwise, cathodic sample application was used with 
i he same sample bufTer as for wide-range IPGs. 

2.? Protein compositions used for pi calculations 

With the exception of vimcntin. protein compositions 
arc irom the Swiss-Prot database [18]. For vimentin. we 
used the data from [19]. where the amino acid at posi- 
non 4| ,s a D instead of a S. Information in the Swiss- 
rroi uataoase on phosphorylation has been disregarded 
because u was known from earlier studies (J E Ceiis 
unpublished results) that the spots in question' corre- 
sponded to the unphosphorylated forms of the peptides 
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different subsiituems on the c-carbon were taken in, , 
the aid of the IPG-maker program (20]. 
2.7 pA' values used for p/ calculations 

fnri i«,« a, i b0X 5 lenninal gr0Up and imernal fuiamvl 
and aspanyl residues the same pA" values were used as in 

irL V ," ninal glutamyl and aspanyl residues, sep- 
arate pA values were derived with the aid of the Tart 

Si ^"h ?* 211 ? e PK Values of hislid >' cro "PS -ere 

6 se 1 1 nT 9 l lh | P/ v a,UeS ° f hUma " " rb0nic anh >" 
drase I as in W For A-terminal glycine a pA" value of 

he c^rho 6 '- ^ PA ' Shift C3USed b >- a subst.tuent on 
the c-carbon was assumed to be identical with the pA' 
shift the subst.tuent caused for the amino group i the 
ammo acd 2.28 P H units were subtracted from Z 

S. r3 S i! he amino sroups in the 3mino acids *™ 

n [22. 23]. The approximate p* value of 9 for the cvs- 
tenyl group was taken from [24]. For tyrosvl and arciml 
groups we used the pA" values for the amino acids [22 

fill' yS> gr ° UPS ,hC efTeci of hi - eh ur " concenira: 
tion on ammo groups was taken into account and 0 « dH 
units were subtracted from the amino acid pA" value 
These last three pA' values are far from the pH ranee 
under study and the results found would have been «£ 
same if lysyl and arginyl groups were assumed to be 
fully ionized while the ionization of tyrosvl groups were 

fn Table 1 C ° mP ' ete '* ° f thC pA ' va,uei USed ' s ttiven 
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C-terminal 
V-icrminal 
Ala 
Met 
Ser 
Pro 
Thr 
Val 
Glu 
Internal 
Asp 
Glu 
His 
Cys 
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Arg 

C-ierminal side chain groups 
Asp 
Glu 
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2.6 Calculation of p/ values 

For the p/ calculations it was assumed thai the same d* 
value could be used for an amino acid residue in all 
polypept.des and in all positions in the peptide except 
io A- or C-terminally placed amino acids. For the pA' 
values of the A-terminal amino groups the effect of the 



2.8 Statistical analysis 

Statistical comparisons of the experimental and calcu- 
lated p/ values were done on an Apple Macintosh Ilsi 
using the statistical package Statistica/Mac. release 3 0b 
(from StatSoft Inc.. Tulsa. Oklahoma). Calculated and 
experimental p/ values were compared bv the /-test for 
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correlated samples (paired r-test). The normalitv of p/ 
differences was estimated graphically by probability 
plots. The variances of the data presented "here and the 
similar data on plasma and liver proteins in [9] were 
compared by the F-test. 

3 Results and discussion 

3.1 Identification of polypeptides and pi determinations 

The 2-D gel maps of ["S]methionine-labeled proteins 
from noncultured. unfractionaied normal human kerati- 
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nocytes. focused with the nonlinear, wide-rance IPG - • 
. *! P" gradients in the first dimension." jre >no«- 
m Figs. 1 and 2. respectively. The IPG extends 10 hi- •» 
pH values but otherwise the two patterns are \er\ '<—■ 
ilar and most of the spots in the IPG pattern can rs 
directly related to the corresponding spots m its 
CA-IEF gel. To obtain comparable patterns it was imp,> r - 
lant to keep the focusing temperature as similar .i> 
possible. Compared to other studies [1-4. 9. 10. 12- 1-;;. 
we increased the urea concentration in the focusing eel 
to 9.8 m because keratins streaked badly in the focusing 
dimension when 8 m urea was used, presumably due to 
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aggregates of acidic and basic keratins. An incase in 
urea concentration to 9 m or more eliminated these 
streaks: apart from this effect, no other maior chanees in 
the focusing positions were observed. In Fie 1 we" have 
ind.cated the positions of 41 known proteins from the 
human kerat.nocyte 2-D gel database that are most 
nicety common to most human cell tvpes. The choic- 
was made because these proteins are'easv to identifv 
with certainty. Wj,h the exception of strat'if.n (spot •>') 
mvolucnn (spot 4) and keratin 14 (spot 15). which are all* 
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3.2 Comparison between the determined and calculated 
p/ values for human keratinocyte proteins 

Thirty six of the 41 proteins listed in Table 2 are found 
in the Swiss-Prot database. Contrary to the plasma and 
liver proteins used in [9J. the pi calcuations on the pro- 
teins used in this study posed some problems that 
reflected the way in which they were characterized. The 
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According to Brown and Robert [25]. proteins with acety- 
lated A-terminals correspond in weight to approximately 
80% of the soluble protein in ascites cells. Based on 
results from A -terminal sequencing, at least 40°/o of the 
spots in the human liver protein 2-D gei map appear to 
be blocked [3]. The corresponding number, derived from 
107 spots in the 2-D gel map of human T-lymphocvte 
proteins, falls between 60 and 65 Pd (J. Strahier. personal 
communication). Information concerning A'-terminal 
blockage is not normally available, and in the Swiss-Prot 
database only 6 of the 36 keratinocyte proteins are speci- 
fied as A-terminally blocked. We have, within the present 
material, defined 18 proteins for which the /V-terminais 
are very likely to be correctly described. Six of these pro- 
teins are listed in the Swiss-Prot database as A'-termi- 
nally blocked, four represent proteins which appear in 
the human liver 2-D gel map and have been A-iermi- 
nally sequenced as liver proteins [3] and the remaining 
eight have A'-terminal groups other than M. S and A. i.e. 
V-terminals for which A'-acetylation is uncommon [26]. 
In Figs. 4A. B. C and D pi values calculated from Swiss 
Prot database information are plotted against the experi- 



mentally determined p/ values for all the kcratrov. 
proteins listed in Table 2 and for the IS seleciec -v- 
tems. as well as for the plasma and liver protein^ J ■• 
from [9] valid for 10°O*. 

The calculations show that without knowledge of the 
status of the A-terminal group, precise predictions of p/ 
values for eukaryotic proteins cannot be achieved based 
on the information available in Swiss-Prot and similar 
databases. However, for proteins where the A'-terminal 
status is known, we find good correlation between pre- 
dicted and experimental pi values. When the variance of 
the pi discrepancies and the variance of calculated 
charges at the experimental pi values derived from the 
present data set are compared with the corresponding 



There arc lour plots: <A) the }o polypeptides trom normal human 
keraunocytes (no corrections!. (B> the 3o pol> peptides from Fte. J \ 
where p/ values have been recalculated Tor i: polypeptides uiih M. 
S and A as V-termmally assumed blocked, based on calculated 
charge. iC) the 18 selected polypeptides uiih inlormanon on the 
N-termtnal configuration, and <D> piasma and liver protcin> 
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Atuin^no k j6 Polypeptides from Fig. 4 A (including the 16 marker polypeptides, where p/ values have been recalculated 

assuming A-terminal blockage; x indicates recalculated p/ values: nucleolar proie.n B23 is indicated with an arrow (C) 18 polypeptides with mlor- 
mat.on on A-terminal connguration and <D) plasma and liver prote.ns. 
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values derived from the data on plasma and liver pro- 
teins in [9) (Table 3). the present data are found to result 
• in larger variances for the values of both p/ discrepancies 
and calculated charge at the experimental p/ value when 
no information on posttranslational modification is 
taken into consideration. Correction for possible A-aceiy- 
lation of 12 polypeptides with M. S and A as .Y-terminal 
results in a smaller variance of pi discrepancies, al- 
though not significantly different from values derived 
from [9], whereas the variance of the calculated charge at 
the experimental p/ value is significantly higher. For the 
18 selected proteins the variance for the p/ discrepancies 
is significantly smaller than for the data in [9); however, 
the corresponding value for calculated charge at the" 
experimental p/ value does not improve to the same 
extent. This, we believe, reflects another difference 
between the two sets of proteins used for the calcula- 
tions. Based on spot distributions in 2-D gel maps, the 
set of proteins used here has a moiecularVeight distri- 
bution that is more representative of the patterns ob- 
served in mammalian cells. In the study bv Bjeliqvist 
etaL [9] most of the high molecular weight plasma pro- 
teins had to be excluded due to their unknown content 
of sialic acid which made the proteins analyzed in this 
study heavily biased towards low molecular weight pro- 
teins. The buffer capacity of proteins normally increases 
with the protein's molecular weight, and the average 
buffer capacity of the presently selected proteins with 
assumed known A-iermmals is 18 charge units/pH unit, 
while the corresponding value for the "proteins used in 
[9] is only 9 charge units/pH unit. High buffer capacity 
can be expected to improve the agreement between cal- 
culated and experimental p/ values. Inspection of the 
data presented in Table 2 for the polypeptides with 
assumed known A'-ierminals verifies the importance of 
the buffer capacity. For 8 polypeptides having buffer 
capacities higher than 15 charge units/pH unit, the calcu- 
lations in all cases yielded p/ discrepancies with absolute 
values of less than 0.02 pH units. The largest discre- 
pancy. 0.06 pH units, was observed for annexin II and 
stathmin. proteins which have low buffer capacity: 0.9 
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and 6.6 charge uniis/pH unit, respective TV P mh. 
bihty that the focusing position of a' pro et w^h know" 
composition will fall within a certain disunce r>om i 
calculated p/ vah^e therefore cannot be predicted £ he" 
variance alone. T*e buffer capacity of the'specific protein 
must be taken into consideration as well. As indicated 
by the decrease of the variance of calculated charces at 
the experimental p/ value for the selected proteins, the 
observed improvement can not soleiv be due to the 
higher buffer capacity of the keratinoevte proteins. The 
two studies relate to different experimental conditions 
uood agreement between experimental and calculated 
p/ values implies that the proteins are defolded and a 
tactor that may contribute to the observed improvement 
is a more complete defolding of proteins caused bv the 
higher temperature and urea concentration used m this 
study. 

The data indicated that the precision with which p/ 
values can be predicted for polypeptides with high buffer 
capacity is better than the precision with which experi- 
mental p/ values can be determined. If the pH is defined 
through the pA values of the immobilized sroups in the 
IPG containing gel. the precision of the experimental 
calculated data will depend on the pH difference 
between the p/ and the pA* value of the immobilized 
group with the closest pA'. For the present studv this will 
give p/ determinations with a precision varvinc in the 
range of ± 0.02-0.05 pH units [9). The good agreement 
observed between the calculated and experimental p/ 
values is due to the fact that errors are mainlv svstem- 
atic and. as discussed in [9]. they will largely be cancelled 
out tn the calculations. A pH scale defined throueh the 
presently determined p/ values will not necessarily 
renew the variation of the hydrogen ion activity during 
the focusing step in an optimal waw but it still allows 
precise predictions of focusing positions for polypeptides 
with known compositions, including information on 
posttranslational modifications. Calculated net charge at 
the experimentally found isoelectric point defined in this 
scale will serve as a tool to verify that the polypeptide 
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composition used in the calculation is correct and com- 
plete. Exceptions to this are proteins such as involucrin 
and heat shock protein 90 that have very high buffer 
capacities. Introduction of an extra charge unit into 
these proteins will only result in p/ shifts falling in the 
range of 0.01-0.02 pH units and the efTect is that the 
quality of the pH definition - the precision by which pA' 
values used in the calculations are given and the preci- 
sion of experimental p/ values in these cases - will limit 
the possibilities to verify polypeptide compostion based 
on the experimental p/ value. 

Statistical comparison of experimental and calculated p/ 
values was done using the /-test for dependent samples 
and normality of the discrepancies was estimated by 
probability plots. For the 36 proteins, the p-level is 
0.0021, indicating that a result like this is unlikely to 
be a chance effect and must be assumed to represent a 
real difference. After correction for the most likcl> 
A'-terminal configuration, the p-level is 0.043 and cannot 
be accepted as representing the same population since 
the p-level is less than 0.05 - the traditional p-limit of 
statistical significance. For the 18 proteins with a known 
or very likely A'-terminal configuration the /-test gave a 
p-level of 0.49. which verifies that the experimental and 
calculated p/ values are not significantly different. 

Besides showing thai p/ values for denatured proteins 
with known compositions can be calculated with a high 
degree of precision from average pA' values, the results 
also provide strong support for the notion that 
A'-terminal blockage heavily depends on the nature of 
the A'-terminal groups [26]. The results seem to indicate 
that with A'-terminals other than M. S and A. only a few 
proteins have blocked A'-terminals (1 out of 10 proteins 
in the present study), while it can be inferred from the 
data presented in Table 2 that a majority of the proteins 
with M. S and A as A'-terminal are blocked. After correc- 
tion for the efTect of suspected A'-terminal blockage 
there is only one protein (nucleolar protein B23) out of 
the 36 used in this study, which, in spite of a high buffer 
capacity, has a marked difference of 0.11 pH units 
between predicted and determined p/ values (Fig. 4B): 
this corresponds to 3 charge units due. to the high bufTer 
capacity of this protein. This discrepancy in p/ prediction 
and calculation of net charge at the p/ is probably not 
due to deficiencies in the database information but 
instead reflects a shortcoming of the model used for pi 
calculations. Nucleolar protein B23 contains a domain 
extremely rich in aspanic and glutamic acid residues 
(Table 4). in which 26 out of 28 amino acid residues 
from position 161 to 188 are either a D or an E. A calcu- 
lation based on the use of average pA* values unin- 
fluenced by the charged neighboring amino acid resi- 
dues cannot be expected to correctly describe the p/ 
value with almost half of the acidic groups packed 



Table 4. Amino acid sequence of nucleolar phosphoprotein B23 




together into a highly negatively chareed reeron Tr - 
limitation caused by calculations based on averse ?k 
values does not severely limit the usefulness o: : — 
approach since a search through Swiss-Prot shous 
this type of D/E-rich motif is uncommon, and the ev>- 
tence of a highly charged region is immediately apparcr.: 
upon inspection of the amino acid sequence. 

The quality of the information available in databases, 
especially concerning posttranslational modifications, is 
a major problem when the data is to be used for p/ pre- 
dictions. The p-level of 0.043 found for all 36 proteins 
after correction for .V-acetylation. shows thai this prob- 
lem is not only limited to A'-terminal blockage and the 
very good agreement found for the eighteen poi> pep- 
tides, with assumingly correctly described .Y-iermma! 
(Fig. 4C). must be regarded as an exception from this 
point of view. A'-Terminal blockage is generally the main 
problem in relation to p/ predictions for eukaryotic pro- 
teins. Of the 36 keratinocyte proteins analyzed. 18-20 
are suspected to be A-terminally blocked (6 proteins blo- 
cked according to Swiss-Prot. 12 proteins with M. S or A 
as A'-terminal and assumingly blocked based on the cal- 
culated charge, and two proteins, involucrin and 
nucleolar protein B23. with M as A'-terminal for which 
the data does not allow any conclusion). This is in rea- 
sonable agreement with the conclusions based on the 
A'-ierminai sequencing data derived in connection with 
2-D gel electrophoresis. A'-terminal blockage can be sus- 
pected for 17-19 of the 26 proteins withlvl. S or A as 
A'-terminal. while only 1 in 10 proteins with other 
A'-terminal groups are blocked. The information that the 
frequency of A-ierminal blockage is strongly related to 
the nature of the A'-terminal group will be of some help 
in connection with p/ predictions based on database 
information. However, without information from other 
sources, an uncertainty will always remain as to whether 
the A'-terminal charge should be included in the p/ calcu- 
lation. 



4 Concluding remarks 

The data presented here lays the foundation for com- 
paring 2-D gel protein maps of different cell types gener- 
ated with nonlinear, wide-range IPGs in the first dimen- 
sion. The focusing positions of 41 polypeptides common 
to most human cell types have been described in a pH 
scale that allows focusing positions to be predicted with 
a high degree of accuracy, provided that the composition 
of the polypeptides are known and that information on 
posttranslational modifications are available. For poly- 
peptides with a very high buffer capacity, the limiting 
factor is the precision with which experimental pH 
values can be determined rather than the precision of 
the calculations. Possible deficiencies in the pH scale 
description of the variation of the hydrogen ion activity 
has. at least at the present state, no consequences for its 
practical use. The major limitation in connection with 
predictions of focusing positions from polypeptide com- 
positions is the quality of existing data on protein com- 
positions, especially concerning posttranslational modifi- 
cations. Amino acid sequences have been reasonably 
easy to obtain, while posttranslational modifications 
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have been difficult and work-iniensive to determine. 
Recent developments in the field of mass spectrometry 
are fast changing this situation and within the next vears 
we can expect a surge in reliable data in this area. While 
awaiting this development, verification of correctness 
and completeness of available information on polypep- 
tide composition can be provided by experimental p/ 
values in a pH scale based on the pi values determined 
in this study. So far. our data cover the pH range below 
pH 7.5. The basic pH range covered by NEPHGE as 
first dimension will be covered in forthcoming work. 

Received December 29. 1993 
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Large Scale Biology Corporation is the leader in the integrated discovery, production 
and application of proteins - the functional units of all biological processes. 

Large Scale Biology Corporation (LSB, Vacaville, CA) and its subsidiary Large Scale 
Proteomics Corp. (LSP, Germantown, MD) are a biotechnology enterprise with the mission of 
accelerating the speed and productivity of the life sciences industry product discovery and 
development programs. Unique among biotechnology companies is LSB's integration of 
technologies to discover, analyze, manufacture and find new applications for proteins - the 
functional units of all biological processes. 

Genomics companies have focused on deciphering genetic information, providing an initial but 
only partial understanding of biological processes. LSB's proprietary protein technologies can 
enable the transformation of genomic information into products such as drug targets, 
therapeutics, diagnostics for drug efficacy and toxicity, and traits for agricultural crops. Large 
Scale Biology has gone beyond the "genomics" realm in its business model and developed 
ways to integrate the discovery of gene function with quantitative protein analysis and protein 
manufacturing. This integration of technology platforms favorably positions LSB as a leading 
provider of valuable content to industry leaders in the fields of diagnostics, therapeutics, 
vaccines and agribusiness. 

LSB was founded in 1987 with the goal of commercializing its proprietary GENEWARE viral 
vector system - a novel technology for gene expression. Using safe RNA viruses to transiently 
express genes in non-recombinant plants, LSB has positioned itself in the industry to provide 
cost-effective manufacturing and purification of diverse protein and peptide products. The 
same technology can be applied to the expression of libraries of foreign genes in an 
automated, high-throughput format to discover the function of genes with unparalleled 
efficiency. The GENEWARE system and associated proprietary technologies form the basis 
for LSB's functional genomics, biomanufacturing and a variety of proprietary products under 
development. 

From its foundation, LSB understood the need to integrate functional genomic and protein 
manufacturing expertise with quantitative protein analysis and informatics to become a 
world-leader in the protein field. In 1999, LSB acquired a privately held pharmaceutical 
proteomics company originally founded in 1985. Large Scale Proteomics Corporation (a wholly 



05/04/2001 8:20 Afv 



Biosource Technologies 



hup://ww,w.lsbc.com/ini'o/inl'o.hir 



owned subsidiary of Large Scale Biology Corporation) is an industry leader in identifying and 
characterizing proteins in all types of biological samples for the discovery and development of 
new and more effective therapies, diagnostics, and agricultural products. 

"Proteomics" is the study of the entire complement of proteins expressed in a cell, tissue, or 
organism. Proteomics can significantly improve drug discovery and development because 
most illness is associated with imbalances among, or malfunctions of, proteins. Only a small 
fraction of diseases can be attributed to the presence of a defective gene. Unlike classical 
genomics approaches that discover genes that may relate to a disease, LSP has developed a 
proprietary system called the ProGEx module for directly characterizing proteins associated 
with disease. Using this same technology, LSP can characterize the effects of candidate drugs 
intended to reverse a disease process, and to determine the degree to which this objective is 
achieved free of adverse side effects. 

LSB and LSP have protected their many discoveries though an extensive portfolio of domestic 
and foreign patents and have developed commercial alliances and partnerships to exploit the 
value of their technologies. LSB and LSP scientists and engineers focus on the development 
and application of resources to help clients meet their objectives as well as the development of 
our own proprietary products for subsequent partnering with industry leaders. 

A combined staff of 140 professionals operates from three locations in the United States with 
a network of collaborators and affiliates throughout the US and Europe. Company 
headquarters, R&D laboratories and its Genomics division are located in Vacaville, California 
about 60 miles northeast of San Francisco. Process development and biomanufacturing take 
place in Owensboro, Kentucky, and LSB's Large Scale Proteomics Corporation subsidiary is 
located in Germantown, Maryland. 

In August, 2000, LSB completed an initial public offering (IPO) of 5 million shares of common 
stock and now trades on the NASDAQ under the symbol LSBC. 

Leadership - Large Scale Biology Corporation 

Robert L Erwin, Chairman of the Board and Chief Executive Officer, founded LSB ™ and has 
served as a director and officer since 1987. Mr. Erwin is the former chairman of the State of 
California Breast Cancer Research Council and currently serves on the University of California 
President's Engineering Advisory Council. He is Chairman of the Supervisory Board of Icon 
Genetics AG. As a co-founder of Sungene Technologies Corp., Mr. Erwin served as Vice 
President of Research and Product Development from 1981 through 1986. He has served on 
the Biotechnology Industry Advisory Board for Iowa State University. Mr. Erwin received his 
M.S. degree in Genetics from Louisiana State University and is an inventor on several LSB 
patents. 

David R. McGee, Ph.D.,a co-founder of LSB and Senior Vice President and Chief Operating 
Officer, has been an officer since 1987. Prior to joining LSB, Dr. McGee was Vice President of 
Operations at Sungene Technologies Corporation from 1983 to 1987. Dr. McGee received his 
Ph.D. in Genetics from Louisiana State University and served as a faculty instructor of zoology 
and genetics at Louisiana State University. 

Laurence K. Grill, Ph.D.,a co-founder of LSB and Senior Vice President, Research and 
Development, has served as an officer since 1987. Dr. Grill was the Manager of Plant 
Molecular Biology for Sandoz Crop Protection Corp. from 1984 to 1987 and Senior Research 
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Scientist in the Department of Molecular Biology at Zoecon Research Institute from 1980 to 
1984. He received his Ph.D. from the University of California at Riverside with an emphasis on 
the molecular basis for viral gene expression in plants. 

R. Barry Holtz, Ph. D., Senior Vice President, Biopharmaceutical Manufacturing, has served 
the company as an officer since 1989 upon the acquisition of Holtz Bio-Engineering, which 
was founded in 1 980. Dr. Holtz was a co-founder and Director of Research for MFI, Inc., the 
largest manufacturer of microencapsulated nutrients for agriculture and Director of 
Fundamental Research at Foremost-McKesson, Inc. Dr. Holtz received his Ph.D. in 
Biochemistry from Pennsylvania State University and served as Assistant Professor in the 
Department of Food Science and Nutrition at Ohio State University. 

Daniel Tuse, Ph.D., has been an officer of LSB since he joined the Company in 1995 as Vice 
President, Pharmaceutical Development. Dr. Tuse manages the company's pharmaceutical 
design and development programs, including LSB's novel vaccines and immunotherapeutics 
initiatives. Prior to joining LSB, Dr. Tuse was Assistant Director of SRI International's (Menlo 
Park, Calif.) Life Sciences Division. In his 17 years at SRI, Dr. Tuse developed extensive R&D 
experience in pharmaceuticals and specialty chemicals, serving an international list of clients. 
Dr. Tuse received his Ph.D. in Microbiology (1980, cum laude) with a minor in Toxicology from 
the University of California, Davis. 

John S. Rakitan, a co-founder of LSB, Senior Vice President & General Counsel and 
Secretary, has served as an officer since 1988. Prior to joining LSB, Mr. Rakitan was an 
attorney in private practice. Mr. Rakitan received his J.D. degree from the University of Notre 
Dame. 

Michael D. Centron, Treasurer, has served as Controller since 1988 and was elected as 
Treasurer in 1 991 . Mr. Centron was Audit Supervisor for Varian Associates from June 1 985 
through July 1988, and he also worked for Arthur Young and Co. (currently Ernst & Young). 
Mr. Centron is a certified public accountant and received his M.B.A. degree from the University 
of California at Berkeley. 

Guy della-Cioppa, Ph.D., is an officer of the company and currently serves as Vice President, 
Genomics. Prior to joining the company in 1989, Dr. della-Cioppa worked for Monsanto 
Company in St. Louis, MO from 1984-1989 and was an NIH Postdoctoral Fellow at the 
Worcester Foundation for Experimental Biology in Shrewsbury, MA from 1983-1984. He 
received his Ph.D. in Biology from the University of California, Los Angeles. 

William M. Pfann joined Large Scale Biology in August 2000 as Senior Vice President Finance 
and Chief Financial Officer. Mr. Pfann was formerly with PricewaterhouseCoopers LLP from 
1969 to July 2000, most recently as the Risk Management Partner for the Western Region. He 
served in a number of management roles at PwC, including leader of the firm's Silicon Valley 
audit practice, National Director of the networking and communications sector and Managing 
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University of California, Berkeley, in Business Administration and an MBA in Accounting from 
Golden Gate University. 
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N. Leigh Anderson, Ph D.,Chairman, President and CEO of Large Scale Proteomics 
Corporation (LSP™). Dr. Anderson obtained his B.A. in Physics with honors from Yale and a 
Ph.D. in Molecular Biology from Cambridge University (England) working with M. F. Perutz as 
a Churchill Fellow at the MRC Laboratory of Molecular Biology. Subsequently he co-founded 
the Molecular Anatomy Program at the Argonne National Laboratory (Chicago) where his 
work in the development of 2-dimensional electrophoresis (2-DE) and molecular database 
technology earned him, among other distinctions, the American Association for Clinical 
Chemistry's Young Investigator Award for 1982 and the 1983 Pittsburgh Analytical Chemistry 
Award. In 1985 Dr. Anderson co-founded LSP (originally Large Scale Biology Corp., 
Germantown, MD) in order to pursue commercial development and large-scale applications 
of 2-D electrophoretic protein mapping technology. 

Norman G. Anderson, Ph.D.,Ch\ei Scientist at LSP. Dr. Anderson has a distinguished record 
as an inventor. His career includes senior positions at Oak Ridge and Argonne National 
Laboratories (ORNL and ANL), more than 300 scientific publications, and the receipt of more 
than 20 prestigious awards in recognition of his work in science and technology. For his 
invention of the zonal ultracentrifuge, he received the John Scott Medal Award, and for the 
centrifugal fast analyzer, the Preis Biochemische Analytik fur Klinische Chemie from Die 
Deutsche Gesellschaft fur Klinische Chemie for the most outstanding analytical development 
in clinical chemistry worldwide during a 2-year period. In 1984 ANL awarded him its career 
patent leader award for the largest number of patents issued to an employee. At that time the 
commercial value of his inventions in terms of U.S. sales and royalties from foreign licensing 
were $250 million and $1 million, respectively. Dr. Anderson received his degrees at Duke 
University: a B.A. in Zoology, M.A. in Physiology, and Ph.D. in Cell Physiology. He holds 28 
patents. 

Constance Seniff,V\ce President, Operations. Ms. Seniff has managed LSP's operations 
since 1993. Her background includes thirteen years in international business prior to joining 
LSP, five abroad in the employ of foreign firms. Ms. Seniff is responsible for helping 
formulate and implement business development and database commercialization strategies 
for LSP in coordination with the management of LSP's parent company, Large Scale Biology 
Corporation. Ms. Seniff has a B.Sc. degree in Business (with honors) from Florida State 
University. 

Robert J. Walden, Vice President, Finance at LSP. Mr. Walden joined LSP in 1997 and has 
served as a director since 1 999. He previously served as Vice President of Finance and 
Administration at Osiris Therapeutics, Inc., and as Chief Financial Officer at the American 
Type Culture Collection (ATCC). Mr. Walden received his degree in Finance from the 
University of Maryland. 

Jean-Paul Hofmann, Ph.D.,V\ce President, Software Development at LSP. Dr. Hofmann is a 
plant geneticist by training, having earned a B.S. in Biology, M.S. in Biochemistry and 
Genetics, and Ph.D. in Plant Genetics from the University of Orsay, Paris. He has extensive 
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experience in using 2-DE in agronomic research and in designing analytical software for 1- 
and 2-D applications. He has held senior scientific positions in industry and research 
institutes, in the U.S., France and the Ivory Coast. 

John Taylor, Ph.D., Vice President, Software Development and Bioinformatics. Dr. Taylor is 
the principal developer of Kepler™, LSP's analytical software for automated 2-DE pattern 
analysis. Prior to joining LSB, Dr. Taylor served as computer scientist in the Molecular 
Anatomy Program at Argonne, and on the research staffs of the University of Chicago and 
the Armed Forces Institute of Pathology in Washington, D.C. Dr. Taylor received a B.S. in 
Physics from the University of South Carolina, and a Ph.D. in Nuclear Physics from Duke 
University. 

Sandra Steiner, Ph.D., currently serves as Vice President Proteomics Applications. Prior to 
joining the Company, Dr. Steiner founded and directed the Molecular Toxicology Group at 
Novartis in Basel, Switzerland and was a member in several multi-disciplinary drug 
development project teams. Dr. Steiner received her Ph.D. in Toxicology/Pharmacology from 
the University of Basel, Switzerland. 
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The availability of genome-scale DNA sequence information and reagents has radically altered life-science 
research. This revolution has led to the development of a new scientific subdiscipline derived from a combina- 
tion of the fields of toxicology and genomics. This subdiscipline, termed toxicogenomics, is concerned with the 
identification of potential human and environmental toxicants, and their putative mechanisms of action, through 
the use of genomics resources. One such resource is DNA microarrays or "chips," which allow the monitoring of 
the expression levels of thousands of genes simultaneously. Here we propose a general method by which gene 
expression, as measured by cDNA microarrays, can be used as a highly sensitive and informative marker for 
toxicity. Our purpose is to acquaint the reader with the development and current state of microarray technol- 
ogy and to present our view of the usefulness of microarrays to the field of toxicology. Mol. Carcinog. 24:153- 

159, 1999. © 1999 Wiley-Liss, Inc. 
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INTRODUCTION 

Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
enormous database of sequence information over the 
past decade. To date, more than 3 million sequences, 
totaling over 2.2 billion bases [1], are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [2]. The 
first complete sequence of a free-living organism, 
Haemophilus influenzae, was reported in 1995 [3] and 
was followed shortly thereafter by the first complete 
sequence of a eukaryote, Saccharomyces cervisiae [4]. 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion of the Homo sapiens DNA sequence is not far 
behind [5]. 

To exploit more fully the wealth of new sequence 
information, it was necessary to develop novel meth- 
ods for the high-throughput or parallel monitoring 
of gene expression. Established methods such as 
northern blotting, RNAse protection assays, SI nu- 
clease analysis, plaque hybridization, and slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6], high-density filter 
hybridization [7,8], serial analysis of gene expression 
[9], and cDNA- and oligonucleotide-based microarray 
"chip" hybridization [10-12] are possible solutions 
to this bottleneck. It is our belief that the microarray 
approach, which allows the monitoring of expres- 
sion levels of thousands of genes simultaneously, is 
a tool of unprecedented power for use in toxicology 
studies. 



Almost without exception, gene expression is al- 
tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge facing 
toxicologists is to define, under a given set of ex- 
perimental conditions, the characteristic and spe- 
cific pattern of gene expression elicited by a given 
toxicant. Microarray technology offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamentally new approach to 
toxicology testing. 

MICROARRAY DEVELOPMENT AND APPLICATIONS 

cDNA Microarrays 

In the past several years, numerous systems were 
developed for the construction of large-scale DNA 
arrays. All of these platforms are based on cDNAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cDNA approach, cDNA (or genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using 
this method, microarrays of up to 10 000 clones 
can be generated by spotting onto a glass substrate 
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[13,14]. Sample detection for microarrays on glass 
involves the use of probes labeled with fluores- 
cent or radioactive nucleotides. 

Fluorescent cDNA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
scription reactions in the presence of fluorescently 
tagged dUTP (e.g., Cy3-dUTP and Cy5-dUTP) ; which 
produces control and test products labeled with dif- 
ferent fluors. The cDNAs generated from these two 
populations, collectively termed the "probe/' are then 
mixed and hybridized to the array under a glass cov- 
erslip [10,11,15]. The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
for fluor excitation [10,1 1,15]. The data are analyzed 
with custom digital image analysis software that de- 
termines for each DNA feature the ratio of fluor 1 to 
fluor 2, corrected for local background [16,17]. The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides, allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control for hybridization between 
arrays. The research groups of Drs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfully applied to studies of Arabidopsis thaliana 
RNA [10], yeast genomic DNA [15], tumorigenic ver- 
sus non-tumorigenic human tumor cell lines [11], 
human T-cells [18], yeast RNA [19], and human in- 
flammatory disease-related genes [20] . The most dra- 
matic result of this effort was the first published 
account of gene expression of an entire genome, that 
of the yeast Saccharomyces cervisiae [21]. 

In an alternative approach, large numbers of cDNA 
clones can be spotted onto a membrane support, al- 
beit at a lower density [7,22]. This method is useful 
for expression profiling and large-scale screening and 
mapping of genomic or cDNA clones [7,22-24]. In 
expression profiling on filter membranes, two dif- 
ferent membranes are used simultaneously for con- 
trol and test RNA hybridizations, or a single 
membrane is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the image data 
[25-27]. 

Oligonucleotide Microarrays 

Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oligo synthesis on the glass surface by photolithog- 
raphy [28-30]. The strength of this approach lies in 
its ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion of this method to the fields of medical diagnos- 



tics, pharmacogenetics, and sequencing by hybrid- 
ization as well as gene-expression analysis. 

Fabrication of oligonucleotide chips by photoli- 
thography is theoretically simple but technically 
complex [29,30]. The light from a high-intensity 
mercury lamp is directed through a photolitho- 
graphic mask onto the silica surface, resulting in 
deprotection of the terminal nucleotides in the illu- 
minated regions. The entire chip is then reacted with 
the desired free nucleotide, resulting in selected chain 
elongation. This process requires only 4n cycles 
(where n = oligonucleotide length in bases) to syn- 
thesize a vast number of unique oligos, the total num- 
ber of which is limited only by the complexity of the 
photolithographic mask and the chip size [29,31,32]. 

Sample preparation involves the generation of 
double-stranded cDNA from cellular poly(A)+ RNA 
followed by antisense RNA synthesis in an in vitro 
transcription reaction with biotinylated or fluor- 
tagged nucleotides. The RNA probe is then frag- 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with fluor-linked streptavidin (e.g., phycoerythrin) 
after hybridization [12,33]. The signal is detected with 
a custom confocal scanner [34]. This method has 
been applied successfully to the mapping of genomic 
library clones [35], to de novo sequencing by hybrid- 
ization [28,36], and to evolutionary sequence com- 
parison of the BRCA1 gene [37]. In addition, 
mutations in the cystic fibrosis [38] and BRCA1 [39] 
gene products and polymorphisms in the human im- 
munodeficiency virus- 1 clade B protease gene [40] 
have been detected by this method. Oligonucleotide 
chips are also useful for expression monitoring [33] 
as has been demonstrated by the simultaneous evalu- 
ation of gene-expression patterns in nearly all open 
reading frames of the yeast strain S. cerevisiae [12]. 
More recently, oligonucleotide chips have been used 
to help identify single nucleotide polymorphisms in 
the human [41] and yeast [42] genomes. 

THE USE OF MICROARRAYS IN TOXICOLOGY 

Screening for Mechanism of Action 

The field of toxicology uses numerous in vivo 
model systems, including the rat, mouse, and rab- 
bit, to assess potential toxicity and these bioassays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- 
niques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Examples of these assays include the Ames test, 
the Syrian hamster embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe- 
sis, and many others. Fundamental to all of these 
methods is the fact that toxicity is often preceded 
by, and results in, alterations in gene expression. In 
many cases, these changes in gene expression are a 
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far more sensitive, characteristic; and measurable 
endpoint than the toxicity itself. We therefore pro- 
pose that a method based on measurements of the 
genome-wide gene expression pattern of an organ- 
ism after toxicant exposure is fundamentally infor- 
mative and complements the established methods 
described above. 

We are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant-induced gene ex- 
pression profiles. In this method, in one or more de- 
fined model systems, dose and time-course parameters 
are established for a series of toxicants within a given 
prototypic class (e.g., polycyclic aromatic hydrocar- 
bons (PAHs)). Cells are then treated with these agents 
at a fixed toxicity level (as measured by cell survival), 
RNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
microarray chip (Figure 1). We have developed a cus- 
tom DNA chip, called ToxChip vl.O, specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
agents in the model systems are analyzed, and the 
common set of changes unique to that class of toxi- 
cants, termed a toxicant signature, is determined. 

This signature is derived by ranking across all ex- 
periments the gene-expression data based on rela- 

Control 
Population 



tive fold induction or suppression of genes in treated 
samples versus untreated controls and selecting the 
most consistently different signals across the sample 
set. A different signature may be established for each 
prototypic toxicant class. Once the signatures are de- 
termined, gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figure 2 illustrates this signature method 
for different types of oxidant stressors, PAHs, and 
peroxisome proliferators. In this example, the un- 
known compound in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. We anticipate that this general method 
will also reveal cross talk between different pathways 
induced by a single agent (e.g., reveal that a com- 
pound has both PAH-like and oxidant-like proper- 
ties). In the future, it may be necessary to distinguish 
very subtle differences between compounds within 
a very large sample set (e.g., thousands of highly simi- 
lar structural isomers in a combinatorial chemistry 
library or peptide library). To generate these highly 
refined signatures, standard statistical clustering tech- 
niques or principal-component analysis can be used. 

For the studies outlined in Figure 2, we developed 
the custom cDNA microarray chip ToxChip vl.O. 
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Figure 1. Simplified overview of the method for sample trative purposes, samples derived from cell culture are depicted, 
preparation and hybridization to cDNA microarrays. For illus- although other sample types are amenable to this analysis. 
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Figure 2. Schematic representation of the method for iden- 
tification of a toxicant's mechanism of action. In this method, 
gene-expression data derived from exposure of model sys- 
tems to known toxicants are analyzed, and a set of changes 
characteristic to that type of toxicant (termed the toxicant 
signature) is identified. As depicted, oxidant stressors produce 

The 2090 human genes that comprise this subarray 
were selected for their well-documented involve- 
ment in basic cellular processes as well as their re- 
sponses to different types of toxic insult. Included 
on this list are DNA replication and repair genes, 
apoptosis genes, and genes responsive to PAHs and 
dioxin-like compounds, peroxisome proliferators, 
estrogenic compounds, and oxidant stress. Some of 
the other categories of genes include transcription 
factors, oncogenes, tumor suppressor genes, cyclins, 
kinases, phosphatases, cell adhesion and motility 
genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridiza- 
tion intensity is averaged and used for signal nor- 
malization of the other genes on the chip. To date, 
very few toxicants have been shown to have appre- 
ciable effects on the expression of these housekeep- 
ing genes. However, this housekeeping list will be 
revised if new data warrant the addition or deletion 
of a particular gene. Table 1 contains a general de- 
scription of some of the different classes of genes 
that comprise ToxChip vl.O. 

When a toxicant signature is determined, the 
genes within this signature are flagged within the 
database. When uncharacterized toxicants are then 
screened, the data can be quickly reformatted so that 
blocks of genes representing the different signatures 



consistent changes in group A genes (indicated by red and 
green circles), but not group B or C genes (indicated by gray 
circles). The set of gene-expression changes elicited by the 
suspected toxicant is then compared with these characteristic 
patterns, and a putative mechanism of action is assigned to 
the unknown agent. 

are displayed [11]. This facilitates rapid, visual in- 
terpretation of data. We are also developing Tox- 
Chip v2.0 and chips for other model systems, 
including rat, mouse, Xenopus, and yeast, for use in 
toxicology studies. 

Animal Models in Toxicology Testing 

The toxicology community relies heavily on the 
use of animals as model systems for toxicology test- 
ing. Unfortunately, these assays are inherently ex- 
pensive, require large numbers of animals and take a 
long time to complete and analyze. Therefore, the 
National Institute of Environmental Health Sciences 
(NIEHS), the National Toxicology Program, and the 
toxicology community at large are committed to re- 
ducing the number of animals used, by developing 
more efficient and alternative testing methodologies. 
Although substantial progress has been made in the 
development of alternative methods, bioassays are 
still used for testing endpoints such as neurotoxic- 
ity, immunotoxicity, reproductive and developmen- 
tal toxicology, and genetic toxicology. The rodent 
cancer bioassay is a particularly expensive and time- 
consuming assay, as it requires almost 4 yr, 1200 
animals, and millions of dollars to execute and ana- 
lyze [43]. In vitro experiments of the type outlined 
in Figure 2 might provide evidence that an unknown 
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Table 1 . ToxChip v1 .0: A Human cDNA Microarray 
Chip Designed to Detect Responses to Toxic Insult 

No. of genes 



Gene category on chip 



Apoptosis 72 

DNA replication and repair 99 

Oxidative stress/redox homeostasis 90 

Peroxisome prol iterator responsive 22 

Dioxin/PAH responsive 1 2 

Estrogen responsive 63 

Housekeeping 84 

Oncogenes and tumor suppressor genes 76 

Cell-cycle control 51 

Transcription factors 1 3 1 

Kinases 276 

Phosphatases 88 

Heat-shock proteins 23 

Receptors 349 

Cytochrome P450s 30 



*This list is intended as a general guide. The gene categories are not 
unique, and some genes are listed in multiple categories. 

agent is (or is not) responsible for eliciting a given 
biological response. This information would help to 
select a bioassay more specifically suited to the agent 
in question or perhaps suggest that a bioassay is not 
necessary, which would dramatically reduce cost, 
animal use, and time. 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and interpretability of the bioassay and 
possibly reduce its cost. Gene-expression signatures 
could be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
screened for these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassay might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mals. In addition, gene-expression changes are nor- 
mally measured in hours or days, not in the months 
to years required for tumor development. Further- 
more, microarrays might be particularly useful for 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant by studying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endpoints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 

These considerations are also relevant for branches 
of toxicology not related to human health and not 
using rodents as model systems, such as aquatic toxi- 
cology and plant pathology. Bioassays based on the 
flathead minnow, Daphnia, and Arabadopsis could 



also be improved by the addition of microarray analy- 
sis. The combination of microarrays with traditional 
bioassays might also be useful for investigating some 
of the more intractable problems in toxicology re- 
search, such as the effects of complex mixtures and 
the difficulties in cross-species extrapolation. 

Exposure Assessment, Environmental Monitoring, 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers 
of toxicity, termed biomarkers (e.g., peripheral blood 
levels of hepatic enzymes or DNA adducts). Because 
gene expression is a sensitive endpoint, gene expres- 
sion as measured with microarray technology may 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly, 
microarrays could be used in an environmental- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashion, 
microarrays could be used to measure gene-expres- 
sion endpoints in subjects in clinical trials. The com- 
bination of these gene-expression data and more 
established toxic endpoints in these trials could be 
used to define highly precise surrogates of safety. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this infor- 
mation, the nature of the toxic exposure can be de- 
termined or a relative clinical safety factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant for a 
given exposure, based on relative gene-expression 
levels. This general approach may be particularly 
appropriate for occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripheral-blood 
lymphocytes of Polish coke-oven workers exposed 
to PAHs (and many other compounds) is under con- 
sideration at the NIEHS. An important consideration 
for these types of studies is that gene expression can 
be affected by numerous factors, including diet, 
health, and personal habits. To reduce the effects 
of these confounding factors, it may be necessary 
to compare pools of control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Efforts to develop such a national gene- 
expression database are currently underway [44,45]. 
However, this national database approach will re- 
quire a better understanding of genome-wide gene 
expression across the highly diverse human popu- 
lation and of the effects of environmental factors 
on this expression. 
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Alleles, Oligo Arrays, and Toxicogenetics 

Gene sequences vary between individuals, and 
this variability can be a causative factor in human 
diseases of environmental origin [46,47]. A new area 
of toxicology, termed toxicogenetics, was recently 
developed to study the relationship between genetic 
variability and toxicant susceptibility. This field is 
not the subject of this discussion, but it is worth- 
while to note that the ability of oligonucleotide ar- 
rays to discriminate DNA molecules based on single 
base-pair differences makes these arrays uniquely 
useful for this type of analysis. Recent reports dem- 
onstrated the feasibility of this approach [41,42]. 
The NIEHS has initiated the Environmental Genome 
Project to identify common sequence polymor- 
phisms in 200 genes thought to be involved in en- 
vironmental diseases [48]. In a pilot study on the 
feasibility of this application to the Environmental 
Genome Project, oligonucleotide arrays will be used 
to resequence 20 candidate genes. This toxicogenetic 
approach promises to dramatically improve our un- 
derstanding of interindividual variability in disease 
susceptibility. 

FUTURE PRIORITIES 

There are many issues that must be addressed be- 
fore the full potential of microarrays in toxicology 
research can be realized. Among these are model sys- 
tem selection, dose selection, and the temporal na- 
ture of gene expression. In other words, in which 
species, at what dose, and at what time do we look 
for toxicant-induced gene expression? If human 
samples are analyzed, how variable is global gene 
expression between individuals, before and after toxi- 
cant exposure? What are the effects of age, diet, and 
other factors on this expression? Experience, in the 
form of large data sets of toxicant exposures, will 
answer these questions. 

One of the most pressing issues for array scientists 
is the construction of a national public database 
(linked to the existing public databases) to serve as a 
repository for gene-expression data. This relational 
database must be made available for public use, and 
researchers must be encouraged to submit their ex- 
pression data so that others may view and query the 
information. Researchers at the National Institutes 
of Health have made laudable progress in develop- 
ing the first generation of such a database [44,45], In 
addition, improved statistical methods for gene clus- 
tering and pattern recognition are needed to ana- 
lyze the data in such a public database. 

The proliferation of different platforms and meth- 
ods for microarray hybridizations will improve 
sample handling and data collection and analysis and 
reduce costs. However, the variety of microarray 
methods available will create problems of data com- 
patibility between platforms. In addition, the near- 
infinite variety of experimental conditions under 



which data will be collected by different laborato- 
ries will make large-scale data analysis extremely dif- 
ficult. To help circumvent these future problems, a 
set of standards to be included on all platforms 
should be established. These standards would facili- 
tate data entry into the national database and serve 
as reference points for cross-platform and inter-labo- 
ratory data analysis. 

Many issues remain to be resolved, but it is clear 
that new molecular techniques such as microarray 
hybridization will have a dramatic impact on toxicol- 
ogy research. In the future, the information gathered 
from microarray-based hybridization experiments will 
form the basis for an improved method to assess the 
impact of chemicals on human and environmental 
health. 
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Abstract 

Recent progress in genomics and proteomics technologies has created a unique opportunity to significantly impact 
the pharmaceutical drug development processes. The perception that cells and whole organisms express specific 
inducible responses to stimuli such as drug treatment implies that unique expression patterns, molecular fingerprints, 
indicative of a drug's efficacy and potential toxicity are accessible. The integration into state-of-the-art toxicology of 
assays allowing one to profile treatment-related changes in gene expression patterns promises new insights into 
mechanisms of drug action and toxicity. The benefits will be improved lead selection, and optimized monitoring of 
drug efficacy and safety in pre-clinical and clinical studies based on biologically relevant tissue and surrogate markers. 
© 2000 Elsevier Science Ireland Ltd. All rights reserved. 
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1. Introduction 

The majority of drugs act by binding to protein 
targets, most to known proteins representing en- 
zymes, receptors and channels, resulting in effects 
such as enzyme inhibition and impairment of 
signal transduction. The treatment-induced per- 
turbations provoke feedback reactions aiming to 
compensate for the stimulus, which almost always 
are associated with signals to the nucleus, result- 
ing in altered gene expression. Such gene expres- 
sion regulations account for both the 
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pharmacological action and the toxicity of a drug 
and can be visualized by either global mRNA or 
global protein expression profiling. Hence, for 
each individual drug, a characteristic gene regula- 
tion pattern, its molecular fingerprint, exists 
which bears valuable information on its mode of 
action and its mechanism of toxicity. 

Gene expression is a multistep process that 
results in an active protein (Fig. 1). There exist 
numerous regulation systems that exert control at 
and after the transcription and the translation 
step. Genomics, by definition, encompasses the 
quantitative analysis of transcripts at the mRNA 
level, while the aim of proteomics is to quantify 
gene expression further down-stream, creating a 
snapshot of gene regulation closer to ultimate cell 
function control. 
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2. Global mRNA profiling 



3. Global protein profiling 



Expression data at the mRNA level can be 
produced using a set of different technologies 
such as DNA microarrays, reverse transcript 
imaging, amplified fragment length polymorphism 
(AFLP), serial analysis of gene expression 
(SAGE) and others. Currently, DNA microarrays 
are very popular and promise a great potential. 
On a typical array, each gene of interest is repre- 
sented either by a long DNA fragment (200-2400 
bp) typically generated by polymerase chain reac- 
tion (PCR) and spotted on a suitable substrate 
using robotics (Schena et al, 1995; Shalon et al., 
1996) or by several short oligonucleotides (20-30 
bp) synthesized directly onto a solid support using 
photolabile nucleotide chemistry (Fodor et al., 
1991; Chee et al., 1996). From control and treated 
tissues, total RNA or mRNA is isolated and 
reverse transcribed in the presence of radioactive 
or fluorescent labeled nucleotides, and the labeled 
probes are then hybridized to the arrays. The 
intensity of the array signal is measured for each 
gene transcript by either autoradiography or laser 
scanning confocal microscopy. The ratio between 
the signals of control and treated samples reflect 
the relative drug-induced change in transcript 
abundance. 



Global quantitative expression analysis at the 
protein level is currently restricted to the use of 
two-dimensional gel electrophoresis. This tech- 
nique combines separation of tissue proteins by 
isoelectric focusing in the first dimension and by 
sodium dodecyl sulfate slab gel electrophoresis- 
based molecular weight separation on the second, 
orthogonal dimension (Anderson et al., 1991). 
The product is a rectangular pattern of protein 
spots that are typically revealed by Coomassie 
Blue, silver or fluorescent staining (Fig, 2). 
Protein spots are identified by mass spectrometry 
following generation of peptide mass fingerprints 
(Mann et al., 1993) and sequence tags (Wilkins et 
al., 1996). Similar to the mRNA approach, the 
ratio between the optical density of spots from 
control and treated samples are compared to 
search for treatment-related changes. 



4. Expression data analysis 

Bioinformatics forms a key element required to 
organize, analyze and store expression data from 
either source, the mRNA or the protein level. The 
overall objective, once a mass of high-quality 
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Fig. 1 . Production of an active protein is a multistep process in which numerous regulation systems exert control at various stages 
of expression. Molecular fingerprints of drugs can be visualized through expression profiling at the mRNA level (genomics) using 
a variety of technologies and at the protein level (proteomics) using two-dimensional gel electrophoresis. 
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Fig. 2. Computerized representation of a Coomassie Blue stained two-dimensional gel electrophoresis pattern of Fischer F344 rat 
liver homogenate. 



quantitative expression data has been collected, is 
to visualize complex patterns of gene expression 
changes, to detect pathways and sets of genes 
tightly correlated with treatment efficacy and toxi- 
city, and to compare the effects of different sets of 
treatment (Anderson et al., 1996). As the drug 
effect database is growing, one may detect similar- 
ities and differences between the molecular finger- 
prints produced by various drugs, information 
that may be crucial to make a decision whether to 
refocus or extend the therapeutic spectrum of a 
drug candidate. 



5. Comparison of global mRNA and protein 
expression profiling 

There are several synergies and overlaps of data 
obtained by mRNA and protein expression analy- 
sis. Low abundant transcripts may not be easily 
quantified at the protein level using standard two- 
dimensional gel electrophoresis analysis and their 
detection may require prefractionation of sam- 
ples. The expression of such genes may be prefer- 
ably quantified at the mRNA level using 
techniques allowing PCR-mediated target amplifi- 
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cation. Tissue biopsy samples typically yield good 
quality of both mRNA and proteins; however, the 
quality of mRNA isolated from body fluids is 
often poor due to the faster degradation of 
mRNA when compared with proteins. RNA sam- 
ples from body fluids such as serum or urine are 
often not very 'meaningful', and secreted proteins 
are likely more reliable surrogate markers for 
treatment efficacy and safety. Detection of post- 
radiational modifications, events often related to 
function or nonfunction of a protein, is restricted 
to protein expression analysis and rarely can be 
predicted by mRNA profiling. Information on 
subcellular localization and translocation of 
proteins has to be acquired at the level of the 
protein in combination with sample prefractiona- 
tion procedures. The growing evidence of a poor 
correlation between mRNA and protein abun- 
dance (Anderson and Seilhamer, 1997) further 
suggests that the two approaches, mRNA and 
protein profiling, are complementary and should 
be applied in parallel. 



6. Expression profiling and drug development 

Understanding the mechanisms of action and 
toxicity, and being able to monitor treatment 
efficacy and safety during trials is crucial for the 
successful development of a drug. Mechanistic 
insights are essential for the interpretation of drug 
effects and enhance the chances of recognizing 
potential species specificities contributing to an 
improved risk profile in humans (Richardson et 
al., 1993; Steiner et al., 1996b; Aicher et al., 1998). 
The value of expression profiling further increases 
when links between treatment-induced expression 
profiles and specific pharmacological and toxic 
endpoints are established (Anderson et al., 1991, 
1995, 1996; Steiner et al. 1996a). Changes in gene 
expression are known to precede the manifesta- 
tion of morphological alterations, giving expres- 
sion profiling a great potential for early 
compound screening, enabling one to select drug 
candidates with wide therapeutic windows 
reflected by molecular fingerprints indicative of 
high pharmacological potency and low toxicity 
(Arce et al., 1998). In later phases of drug devel- 



opment, surrogate markers of treatment efficacy 
and toxicity can be applied to optimize the moni- 
toring of pre-clinical and clinical studies (Doherty 
et al., 1998). 



7. Perspectives 

The basic methodology of safety evaluation has 
changed little during the past decades. Toxicity in 
laboratory animals has been evaluated primarily 
by using hematological, clinical chemistry and 
histological parameters as indicators of organ 
damage. The rapid progress in genomics and pro- 
teolytics technologies creates a unique opportunity 
to dramatically improve the predictive power of 
safety assessment and to accelerate the drug devel- 
opment process. Application of gene and protein 
expression profiling promises to improve lead se- 
lection, resulting in the development of drug can- 
didates with higher efficacy and lower toxicity. 
The identification of biologically relevant surro- 
gate markers correlated with treatment efficacy 
and safety bears a great potential to optimize the 
monitoring of pre-clinical and clinical trails. 
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Subject: RE: [Fwd: Toxicolog} Chip] 
Date: Men. 3 Jul 2000'08:09:45 -0400 ' 
FromiTAfsh'ari.Cx^ia" ^fshari<sWhs.nih.gov> 
rTd: "IDiana Hamlet-Cox* M <dianahc@incvie.com> 



You car. see the list of clones that we have on our 12K chip at 
httr: manuel -r.iehs.r.ih.ssv raps cues: •clcnesrch.cfr. 

We selected a subset of genes (2000K) rhat we believed critica* --»v 
response and basic cellular processes and added a set of "clones a-.d "^- = - 
this. We have included a set of control genes (80-) that were se*e"e- ~~ 
the KHGR* because they did not change across a larae set of array" " " 
experiments. However, we have found that some of these genes" chance 
signf icar.tly after tox treatments and are in the process" cf loo*--- a 
variation of each of these 80* genes across our experiments." " 
Our chips are constantly changing and being updated and we hope that cu- 
data will lead us to what the toxchip should really be. 
Z hope this answers your question. 
Cindy Afshari 



ne 



■e I have not yet had a response from Bill Grigg, perhaps he was not 
right person to contact. 



> From: Diana Hamlet -Cox 

> Sent: Monday, June 26. 2000 8:52 PM 

> To: afshariQniehs.nih.gov 

> Subject: [Fwd: Toxicology Chip] 
> 

> Dear Dr. Afshari, 
> 

> Since 

> the righz 
> 

> Can you help me in this matter? X don't need to know the sequences 

> necessarily, but 1 would like very much to know what types of seauences 

> are oeing usea, e.g.. GPCRs (more specific?), ion channels, etc' 

> Diana Hamlet -Cox 
> 

> Original Message 

> Subject: Toxicology Chip 

> Date; Mon. 19 Jun 2000 18:31:48 -0700 

> From: Diana Hamlet-Cox <dianahc9incyte.com> 

> Organization: Incyte Pharmaceuticals 

> To: griggGniehs.nih.gov 
> 

> Dear Colleague: 
> 

> 1 am doing literature research on the use of expressed aenes as 

> pharmacotoxicology markers, and found the Press Release' dated February 

> 29, 2000 regarding the work of the NIZHS in this area. 1 would like to 

> know i: there is a resource I can access (or you could provide?) that 

> wouid give me a list of the 12.000 genes that are on your Human ToxChip 

> Microarray. In particular. I am interested in the criteria used to 

> select sequences for the ToxChip. including any control sequences 
included in the microarray. 



Thank you for your assistance in this request. 



Diana Hamlet-Cox, Ph.D. 
Incyte Genomics, Inc. 
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